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Abstract: This paper proposes uni-orthogonal and bi-orthogonal nonnegative matrix 
factorization algorithms with robust convergence proofs. We design the algorithms based 
on the work of Lee and Seung [T|, and derive the converged versions by utilizing ideas 
from the work of Lin [2] . The experimental results confirm the theoretical guarantees of 
. the convergences. 
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1 Introduction 

> 

' The nonnegative matrix factorization (NMF) is a technique that decomposes a nonnega- 

^v^j , tive data matrix into a pair of other nonnegative matrices of lower rank: 

in 

AwBC, (1) 

where A e R+ xN = [ai, . . . ,a N ] denotes the data matrix, B G R+ xK = [bi, . . . , b K ] 
denotes the basis matrix, C £ xN = [ci, . . . , cjv] denotes the coefficient matrix, and 
K denotes the number of factors which usually is chosen so that K <C min(M, N). To 
compute B and C, usually eq. [Tjis rewritten into a minimization problem in Frobenius 
/\ ' norm criterion. 

c3 ■ min J(B,C) = -|[A-BC[|| s.t. B > 0,C > 0. (2) 

B.C 2 

Orthogonal NMFs are introduced by Ding et al. [TT] to enforce orthogonality con- 
straints on columns of B and/or rows of C in order to improve clustering capability of 
the standard NMF (we will refer NMF objective in eq. [2] as the standard NMF for the 
rest of this paper). Because clustering indicator matrices are orthogonal (hard clustering 
cases), imposing orthogonality on columns of B (rows of C) will potentially produce a 
sharper row clustering indicator matrix (column clustering indicator matrix), and there- 
fore it is expected that this mechanism will lead to better clustering methods. 

However, as the original orthogonal NMF algorithms QT] and the variants [TU [13j [14] 
are all based on the multiplicative update (MU) rules, there is no convergence guarantee 
for these algorithms (in section [5] we will explain why MU based algorithms do not have 
convergence guarantee) . And because the orthogonality constraints cannot be recast into 
alternating nonnegativity least square (ANLS) framework (see [SI HSj f° r discussion on 
ANLS), converged algorithms for the standard NMF, e.g., [201 EH M H2 UM [22] , cannot be 
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utilized for solving orthogonal NMF problems. Thus, there is still no converged algorithm 
for orthogonal NMFs. 

The proposed algorithms are designed by generalizing the work of Lin [5] in which 
he provides a converged algorithm for the standard NMF based on the additive update 
(AU) rules. The generalization presented in this chapter is not trivial since the proofs are 
developed in matrix form, thus providing a framework for developing converged algorithms 
for other NMF objectives that have matrix based auxiliary constraints with mutually 
dependency between columns and/or rows (Lin uses vector form for developing the proofs, 
so the interdependency between columns and/or rows cannot be captured). 

Also, in the process of developing the proofs, the objectives need to be decomposed 
into the Taylor series. When the objectives have only up to second order derivatives, 
then the nonincreasing properties can be proven by showing the positive-definiteness of 
the Hessians of the objectives [TJ[2]. But in general cases, the objectives can have more 
than second order derivatives. And in particular, the orthogonality constraints make the 
objectives have more than second order derivatives. Thus, the same strategy cannot be 
used for the general cases. Accordingly, we introduce a strategy to deal with this kind of 
objectives. Note that the proofs presented here are sufficiently general to be a framework 
for developing converged algorithms for other NMF objectives with well-defined partial 
derivatives up to second order. 

2 Multiplicative update algorithm 

In [T], Lee and Seung introduce two MU rules based algorithms for the standard NMF 
using the Frobenius norm and the Kullback-Leibler divergence respectively as the distance 
measure. In addition, they also show how to modify the Frobenius norm based MU 
algorithm into AU version. However, due to numerical difficulties of the Kullback-Leibler 
divergence, and computational requirements of the AU algorithm, only the Frobenius 
norm based MU algorithm is being extensively studied. In this section, we will review 
the Frobenius norm based MU algorithm and discuss the reason why this algorithm do 
not have convergence guarantee. Note that only the Frobenius norm will be considered 
for the rest of this chapter. 

First let us rewrite the standard NMF objective in eq. [2] 

minJ(B,C) = -||A-BC||£ s.t. B > 0, C > 0. (3) 

B.C 2 

The KKT function of the objective is: 

L(B, C) = J(B, C) - tr (r B B T ) - tr (T C C), 

where T B £ xi? and T c elf R are the KKT multipliers. Partial derivatives of L 
with respect to B and C can be written as: 

V B £(B) = V B J(B)-r B , and 

v c £(c) = v c J(c)-r£, 

with 

V B J(B) = BCC T - AC T , and 
Vc«/(C) = B T BC — B T A. 
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By results from optimization studies, (B*, C*) is a stationary point of eq. [3] if it satisfies 
the KKT optimality conditions |24j . i.e., 

B* > 0, C* > 0, 

v B J(B*) = r B > o, v c J(c*) = r£ > o, 

V B J(B*) ©B* = 0, V C J(C')0C*=O, (4) 

where denotes component- wise multiplications, and eq. 2] is known as the complemen- 
tary slackness. 

The MU algorithm is derived by utilizing the complementary slackness: 

(BCC T - AC T ) B = 0, 
(B T BC - B T A) C = 0. 

These equations lead to the following update rules [T]: 

, (AC T ) 

<— b ™r L CG T? r ( 5 ) 



' inr 



rn rn 



(b t a) 

(B T BC) 



Vr, n, (6) 



where k = 0, . . . , K denotes the iteration, K denotes the maximum iteration, b k yir and 
c k n denote (to, r) entry of B and (r, n) entry of C at fc-th iteration respectively. These 
equations are the MU algorithm for the standard NMF problem in eq. [3] 

Theorem 1 (Lee and Seung pQ). Objective in eq. [3] is nonincreasing under the update 
rules eq.®and® i.e., j(B k+1 ',C k+1 ) < j(B k+ \C k ) < j(B k ,C k ) Vfc > 0. 

Theorem 2 (Lin 23 ). If A. has neither zero column nor row, and B° > and C° > 0, 
then B fe > and C k > Vfc > under the update rules eq. and® 

Theorem 3. Given A, B , and C° satisfy the conditions in theorem^ if (B*,C*) is a 
stationary point on the feasible region, then the update rules eq.® and® will stop updating 
B* and C*. 

Proof. Because any stationary point satisfies the KKT conditions and B fc > and C fc > 
Vfc > 0, then by using the complementary slackness it can be shown that Vb^(B*) = 
and V C J(C*) = 0. Accordingly, AC* T = B*C*C* T and B* T A = B* T B*C*, therefore 
B fc = B* and C fc = C* Vfc > *. □ 

Theorem 4. If there exists (m,r) or (r,n) so that b l mr — or c l rn — for some I > 0, then 
when the eq. ® and ® stop updating, there is no guarantee that this point is a stationary 
point. 

Proof. If b l mr = (c l rn = 0), then = (c k n = 0) Vfc > I. Consequently, we must 
make sure that V B </(B)^ r > (V c J(C) k n > 0) Vfc > I for this point to satisfy the KKT 
conditions. When there exists fc such that this requirement is not satisfied, then there is 
no stationarity guarantee. □ 

So, while theorem[3]states that the MU algorithm can reach stationary points, theorem 
|4] gives the reason why the MU algorithm cannot guarantee to converge to the stationary 
points. 
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To avoid division by zero, the MU algorithm usually is modified into: 




TV I 



.(fe+i) 



rn 



mr 



Vr, n, 



Vm, r, 



where 5 is a small positive number. The complete MU algorithm for the standard NMF 
is given in algorithm [TJ 



Algorithm 1 The MU algorithm for the standard NMF (Lee & Seung algorithm pQ). 

Initialization, B° > and C° > 0. 
for k = 0,...,K do 



As stated in theorem 21 the initial values of B and C in algorithm [T] have to be all 
positive to avoid zero locking from the start (see, e.g., [H [53] for detailed discussion on 
zero locking phenomenon). But, as shown in theorem [2] assigning positive initialization 
will lead to solutions that lie on positive orthant of the feasible region, i.e., B fc > and 
C fe > Vfe > (at least theoretically). And consequently, the algorithm cannot find 
stationary points that lie on the boundary of the feasible region. 

Note that some literatures, e.g. [3J recommend to normalize B for each iteration 
so that the Euclidian length of each its columns is one to guarantee the uniqueness of the 
solution (and consequently, each row of C has to be adjusted accordingly to preserve the 
objective value). 

Fig. Q] shows the nonincreasing property of the algorithm [1] which is guaranteed by 
theorem [T] for Reuters4 dataset (see section IQ1 for discussion on the datasets). As the 
error, objective of the algorithm [T] (eq. [3j is used. 

3 Original Orthogonal NMF algorithms 

In [TT], Ding et al. propose two MU rules based orthogonal NMF algorithms: uni- 
orthogonal NMF and bi-orthogonal NMF. 

3.1 Uni-orthogonal NMF 

Uni-orthogonal NMF (UNMF) imposes orthogonality constraint on either columns of B 
or rows of C. We will discuss the orthogonality constraint on rows of C here. Similar 
result for B can be derived equivalently. 




end for 
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Figure 1: Error per iteration (Reuters4 dataset) of algorithm [TJ 



Objective for UNMF with orthogonality constraint on rows of C can be written as: 



minJ(B,C) = i||A-BCf F 

s.t. B > 0, C > 0, ^(CC T I) = 0. 



(7) 



The KKT function of this objective is: 



L(B, C) = J(B, C) - tr (r B B T ) - tr (T C C) + -tr (A c (CC T - I)) , (8) 

where T B 6 R^ xR , T c £ R+ xR , and A c £ R RxR arc the KKT multipliers. Instead 
of solving the three-constraint objective in eq. [3 Ding et al. 11 propose the following 
objective: 



min J(B, C) = i||A- BC||| + itr (Ac (CC T - I)) 
s.t. B > 0, C> 0. 



(9) 



Note that, even though both objectives (eq. [7] and O have the same KKT function, 
i.e., eq. [U they are not exactly the same, as the orthogonality constraint is absorbed into 
the minimization problem. 

The KKT conditions for objective in eq. |9]are: 



B* > 0, 



C* > 0. 



with 



v b J(b*) = r B > o, v c J(c*) = rg > o, 
v b J(b*) ©b* = o, v c J(c*) c* = o, 

VbJ(B) = BCC T - AC T 

V c J(C) = B T BC - B T A + A C C 
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By using the same strategy as in section [2] MU rules based UNMF algorithm can be 
written as: 

, , , (AC T ) mr (iq) 



(BCC ) mr 
(B T A) 



[(B T B + A C )C] 



(11) 



The problem with this algorithm is how to determine Ac- By summing over index r, 
Ding et al. find an exact formulation for the diagonal entries: 

(Ac) rr = (B T AC T -B T B) rr . (12) 

The off-diagonal entries are obtained by ignoring the nonnegativity constraint on C and 
by setting Vc^(C) (J in eq. to zero matrix: 

V c J(C) = -B T A + B T BC + A c C = 0, (13) 
(Ac) rs = (B T AC T - B T B) rs . Vr ? s. (14) 

Eq. [13] is derived from eq. [9] by using the fact ||X||j? = tr(A T A), and eq. [14] is derived 
from eq. Q2] by using the orthogonality constraint CC T = I. By combining eq. [T^] and 
eq. [TH Ac can be defined as: 

A c = B T AC T - B T B. (15) 

Accordingly, the UNMF algorithm can be rewritten as: 

(AC T )„ 
(BCC T ) r 



t V-™-^ Jmr t-ia\ 
Omr T^T^yT (16) 



rn rn (BTAC T C) ■ ( ' 

The complete UNMF algorithm for eq . \W\ and [T71 is given in algorithm [2J Unlike in algo- 
rithm!]] normalization will change the objective value in eq. [9]as there is tr (Ac(CC T — 
I)) component, thus it is not recommended. 

Algorithm 2 UNMF algorithm due to the work of Ding et al. [11] . 

Initialization, B° > and C° > 0. 
for jfe = 0,...,K do 



. (AC feT ) 



J mr 1 mr 

I 13 W i_ " '■ , 

' mr 



(B k C k C kT ) +5 

V / mr 



. , . (B( fc +D T A) 

rn rn (B( k + 1 ) T AC kT C k ) ^ 



end for 



Note that as there is an assumption in deriving Ac , algorithm [5] may or may not 
be minimizing the objective eq. [HI Further, the auxiliary function used by the authors 
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Figure 2: Error per iteration (Reuters4 dataset) of algorithm^ 



to prove the nonincreasing property is for UNMF algorithm in eq. \TU\ and QTJ not for 
algorithm [2l So there is no guarantee that algorithm [2] has the nonincreasing property. 
Figure [2] gives a numerical example on how algorithm [2] not only does not have the 
nonincreasing property but also fails to minimize the objective. As the error, the objective 
of UNMF (eq. |9j) is used with A c defined in eq. [15] 



3.2 Bi-orthogonal NMF 

Bi-orthogonal NMF (BNMF) puts orthogonality constraints on both columns of B and 
rows of C. Therefore it is expected that this technique can be used to simultaneously 
cluster columns and rows of A. The following objective is the BNMF objective proposed 
by Ding et al. [IT] . 

min J(B,C,S) = i||A-BSC|| 2 F (18) 
s.t. B > 0, S > 0, C > 0, — (CC T — I) = 0, — (B T B I) = 0, 

where B G R+ xP and C G M^ xN arc defined similarly as before, and S e Rf Q is a 
matrix that introduced to absorb the different scales of A, B, and C due to the strict 
orthogonality constraints on B and C. We will set P = Q for the rest of this chapter. 
The KKT function can be defined as: 

L(B, C, S) = J(B, C, S) - tr (r B B T ) - tr (r s S T ) - tr (r c C) 

+ itr (A C (CC T - I)) + itr (A B (B T B - I)), 

where r B , r c , A c , T s G M+ xQ , and A B G K+ xP are the KKT multipliers. 

An equivalent objective to eq. [18] is proposed by Ding et al. [H] to absorb the orthog- 
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onality constraints into the objective: 

mm J(B, C, S) = |||A - BSC||| + ±tr (A C (CC T - I)) 

+ itr(A B (B T B-l)) (19) 

s.t. B > 0, C > 0. 
The KKT conditions for objective in eq. [19] are: 

B* > 0, S* > 0, C* > 0, 

v B J(b*) = r B > o, v s J(s*) = r s > o, v c J(c*) = rg > o, 

V B J(B')0B*=O, V S J(S*)0S* = O, V C J(C*)0C*=O, 



with 



V B J(B) = BSCC T S T - AC T S T + BA B , 
V c J(C) = S T B T BSC - S T B T A + A C C, 
V s J(S) = B T BSCC T - B T AC T . 

Then, by using the same strategy as in section^ BNMF algorithm can be written as: 





(AC T S T 


) mp 




4 hjyip 


B(SCC T S T 4 


-A B ) 


mp 




(S T B T A) 


qn 





^qn ' u gn 



[(S^BS + A C )C] ( 

pq 



(B T AC T )j 



m M (B^BSCC T ) 

with 



PQ 



A B = B T AC T S T - SCC T S T and 
A c = S T B T AC T - S T B T BS 

are derived exactly for the diagonal entries, and approximately for off-diagonal entries by 
relaxing the nonnegativity constraints as in section f3. II 

The complete BNMF algorithm is shown in algorithm [3] And as in algorithm^ the 
normalization step is not recommended as it will change the objective value. 

Figure [3] shows error per iteration of algorithm [3] with error is the objective value in 
eq. HH As in the UNMF case, the assumptions taken for obtaining A B and Ac seem to 
be unreasonable since algorithm [3] not only does not have the nonincreasing property but 
also fails to minimize the objective value. 

4 Converged orthogonal NMF algorithms 

In this section, we will present converged algorithms for UNMF and BNMF based on the 
AU rules which have been previously shown by Lin [5] to have convergence guarantee. We 
will recast the orthogonality constraints directly into the objectives, and thus avoiding 
the necessity of absorbing them. We will show that this strategy allows us to design 
converged algorithms for UNMF and BNMF as easy as in the standard NMF case. 
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Algorithm 3 BNMF algorithm due to the work of Ding et al. [TT] . 

Initialization, B° > 0, C° > 0, and S° > 0. 
for k = 0, . . . , K do 



u mp 



(AC kT S kT ) 



' mp 



■nip 



(B fc B fcT AC fcT S feT ) 



flip 



( S fcT B (fc+l)T A ) 

V / qn 



qn fS fcT B( fc + 1 ) T AC fcT C' c ] 



Vq, n 



Jk+i) 
pq 



^ B (fe+l)T AC (fc+l)2 



'pq 



pq ( , B( fe + 1 ) T B( fe + 1 )S fc C( fc + 1 )C( fe + 1 ) T ) + 5 

\ )pq 



Vp,q 



end for 




10 20 30 40 50 6 

iteration 



70 80 90 100 



Figure 3: Error per iteration of algorithm [3] for Reuters4 datasct 



9 



4.1 Converged uni-orthogonal NMF 

We define UNMF objective in following formulation: 

mm J(B, C) = i||A - BC||| + |||CC T - I||| (20) 
s.t. B > 0, C > 0, 

with a is a constant to adjust the degree of orthogonality of C. As shown, the orthogo- 
nality constraint is recast directly into the objective, and the constraints are now similar 
to the standard NMF. 

The KKT function can be defined as: 

L(B, C) = J(B, C) - tr (r B B T ) - tr (r c C). 

And the KKT conditions are: 

B* > 0, C* > 0, 

v b j(b*) = r B > o, v c J(c*) = rg > o, (21) 

V B J(B*) ©B* = 0, V C J(C , )0C* = O ) 

with 

V B J(B) = BCC T - AC T , 

V c J(C) = B T BC - B T A + aCC T C - aC. 

Then, MU algorithm for objective in eq. HDlcan be written as: 

(AC T ) 



(BCC T ^ 



' mr 



(B T A + aC) 
™ (B^BC + aCC T C) 



1 rn 



(22) 
(23) 



The complete algorithm is given in algorithm 2J 



Algorithm 4 The MU algorithm for UNMF problem in eq. 

Initialization, B° > and C° > 0. 
for k = 0,..., K do 



fAC fcT ^ 



h( k + 1 ) j h k 1 Lniz \/m r 

mr mr (B k C k C k T) +5 ' 

V / mr 

..... , (B^^A + aC*) 



(B( fe + 1 ) T B( fe + 1 )C fe + aC k C kT C k ) 



end for 



As shown in [2] , MU algorithm can be modified into an equivalent algorithm with ro- 
bust convergence guarantee by: 1) transforming MU rules into AU rules, and 2) replacing 
zero entries that do not satisfy the KKT conditions with small positive number to escape 
the zero locking. We will employ this strategy to derive converged algorithms for UNMF. 
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AU version of the algorithm in eq. \TI\ and flZTTI can be defined as: 

b rnr i — b?nr / ^B^f^Jmrj 

BLC ) 

V / mr 

Crn ^ Crn ~ (B^BC + aCC T C) VcJ(C) - 



' rn 



As shown, this algorithm is equivalent to the algorithm in eq. l22l and l23l By inspection, it 
is clear that this algorithm inherits the zero locking phenomenon (when VB</(B) mr < 
& b mr = 0; or when Vc</(C) rn < & c rn — 0) from its MU version. Therefore a strategy 
to escape it must be introduced. Algorithm [5] gives the necessary modifications to avoid 
the zero locking, where 

if V B J(B fc ,C fc ) >0 
if V B J(B fc ,C fc )^<0 ' 

1 fV c J(B( fc + 1 ),C fe ) rn >0 
if VcJ(B( fe+1 ),C fe ) <0' [ ' 

V 7 / rn 

are the modifications to avoid the zero locking with a is a small positive number, B and 
C are matrices that contain b mr and c rn respectively, and 

V B J(B fc , C fc ) = B fe C fc C feT - AC kT , 

V c J(B k+ \C k ) = B( fe+1 ) T B( fc+1 'C fc - B< fc+1 ' T A + aC k C kT C k - aC k . 



b k =\ 


r b k 

i ^rnr 


mr 1 


L max^^o-; 






c k =\ 


r c k 

^rn 


^rn — 1 


[ max(c£„,cr) 



Algorithm 5 The AU algorithm for UNMF problem in eq. 

Initialization, B° > and C° > 0. 
for k = 0, . . . , K do 



. (fe+l) ik _ bmr X Vb J(B fc , C fc ) mr 

mr <~ mr (B k C k C kT ) +5 



Vm,r (26) 



Jk+l) , „k C rn X VcJ(B k+1 ,C k ) rn 

™ ( B (fc+i)T B (fc+i)cfe +a c fe C fcT C fc ) + J* ' ' 1 ' 

V / rn *~* 



end for 



Note that as algorithm [5] is free from the zero locking, B° and C° can be initialized 
with nonnegative matrices. Theorem [5] explains this formally. Also, we have <5£, in eq. 1271 
So it is no longer a constant, but a variable that may be different in each iteration. As will 
be explained later, 5^ plays a crucial role in guaranteeing convergence of the algorithm. 

Theorem 5. J/B° > and C° > 0, then B k > and C k > 0, Vfc > 0. And if B° > 

and C° > 0,. then B k > and C fc > 0, Vfc > 

Proof. This statement is clear for fc = 0, so we need only to prove for fc > 0. 
Case 1: VeJmr > =>■ 6 mr = 6 mr . 

, = (B*C*C"-) mr ft£, r + ^ r _ (B k C k C k ^ mr b k mr -(AC kT ) mr b k mr 
( B k C k C kT\ +d (B k C k C kT ) +5 

\ J mr V / mr 



l(^ kT )mr + ^ k 



mr J mr 



(B k C k C kT ) + S 

V / mr 



n 



Thus, if b k mr > then 6^+ 1} > Vm, r, and if b k nr > then btt 1] > Vm, r, Vfc > 0. 
Case 5: Vb 4r < b mr ^ 6 mr . 



h (k+i) =h k max(fe^ r ,(7)VBJ(B fc ,C ;> „, 

( B fc C fc C feT) + ( j 



Note that max (&£„., a) > and V B J(B fc , C k ) mr < 0. Thus if b k mr > then b&? 1) > 
Vm, r, and if b k nr > then 6^+ 1} > Vm, r, Vfc > 0. 
Case 3: Vc Jrn > c rn = c rn . 

( B (fe+i)T B (fc+i) C fe + aC fe C fcT C fc ) rn c^ + 5 k c c k n 



c (fc+i 



( B ( fe +i)T B (fc+i) C fc + a c fe c fe,r c fe ) r + 5, 



'c 



( B (fe+l)T B (fc+l) C fc + aC k C kT C k ) C k 
V / rn rn 

( B (fe+i)T B (fe+i) C fe + a C k C kT C k ) rn + 5 k c 



fB( fe + x ) T A + aC k ) c 



k 

rn 



( B (fe+i)T B (fe+i) C fc + a c fe C fcT C fc ) + 6 k 

V / rn v-' 



[{B(^)T A + a c k ) rn + 5l 



rn 
k] k 
rn ■ -Cj L ™ 



( B (/c+l)T B (fc+l) C fc + a C k C kT C k )_ + <S* ' 



' rn 



Thus if c k n > then c^+ 1} > 0, and if c k n > Vr, n then c^ +1) > Vr, n, Vfc > 0. 
Case 4- VcJ™ < => c rn =/= c rn . 

< k+1)=rk m^(c k n ,a)V c j(B^\C k ) rn 

rn ( B (k+l)T B (k+l) C k +aC k C kT C k\ +S k' 

V / rn 

Note that max (c£ n ,cr) > and V c J(B (k+1 \ C k ) rn < 0. Thus if c k n > then c r h n +1) > 

Vr, n, and if c k n > then c^ +1) > Vr, n, Vfc > 0. 

By combining results for fc = and fc > in case 1-4, the proof is completed. □ 



4.1.1 Convergence analysis 

To analyze convergence property of algorithm^ the nonincreasing property will be shown 
first as it is the necessary condition for the convergence. Because the algorithm solves the 
problem in alternating fashion, i.e., fixing one variable while solving the other, sequence 
j(B fc ) and J(C k ) can be analyzed separately. Thus, by showing that: 

j( B (fe+i)) < j(B k ) and (28) 
j(C (fc+1) ) < J(C fe ), Vfc > 0, (29) 

the nonincreasing property of algorithm [SJ i.e., j(B( fc+1 ) ,C( fc+1 )) < j(B( fc+1 ^ ,C fc ) < 
j(B fe ,C fe ), will b e proven. 

A. The nonincreasing property of j(B fe ) The nonincreasing property of sequence 
j(B fe ) of algorithm l5l (cq. [28]) has been proven by Lin [2]. Here we will describe his proof 
in accord to our more general approach. 
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So far, there is no method to directly prove j(B^ fe+1 ^) < j(B fc ). Fortunately, the 
auxiliary function approach [I] can be utilized as an intermediate function: 

j( B (fc+D) = G (B( fc+1 ),B( fc+1 )) < G(B^ k+1 \B k ) < G(B k ,B k ) = j(B k ). 
To define G, let's rearrange B into: 



where b m is the m-th row of B. And also let's define: 



V s r5(58 fcT ) 



V B 3(B fe ); 



v B a(B fe ); 



v B a(B fc ) 



„MRxM 



where V B 3(B fcN( 



is the m-th row of V B J{B k ) = B fc C fe C fcT - AC kT . Then define: 
D = diag(D\...,D M ) mMRxMR 



where D m is a diagonal matrix with its diagonal entries defined as: 

(B k C k C kT ) +S 



if r € X„ 
if r ^ I„ 



with 



tr {(5B-9S fc )V«8^(Q5 feT )} 
£tr {(<8-<8 fc )D(«8-«8 fc ) T }. 



(30) 



X m = {r|6^ r > 0, V B J(B fc ) mr + 0, or 
C = 0, V B J(B fc ) mr < 0} 
is the set of non-KKT indices in m-th row of B fe , and * is defined so that * = and 

*- x = o. 

Then, the auxiliary function can be defined as: 

1 

Note that 3 and 25 are equivalent to J and G with B is rearranged into 25 T , and other 
parameters are reordered accordingly (one can still use J and G, but it won't be as 
compact as our approach), and also whenever X( fc+1 ) is a variable, we remove (fc + 1) 
sign. And: 

V S T©( s B T , s 8 fcT ) = D(Q5 - <B k ) T + V< B Ta( , B' £T ). 

By definition, D is positive definite for all B fe not satisfy the KKT conditions and positive 
semidefinite if and only if B k satisfies the KKT conditions. Thus <S(Q3 T , s B feT ) is a strict 
convex function, and consequently has a unique minimum, so that: 



D(<8 - %> K y + V< 8 Ta( , B' £T ) = 0, 
Q3 T = <8 fcT - D- 1 V !8 r3(Q3 ,!T ), 



(31) 
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which is exactly the update rule for B in eq. [251 

To obtain an alternative formulation for 3(95 T ) that in the same fashion with © 
formulation, the Taylor series expansion is used. 



3(25 T ) = 3(S fcT ) + tr { (35 - <B fc ) V^a^) } 

+ itr {(95 - <B fc )V|j(B fe ) (95 - 25 fe ) T }. 



(32) 



where 



V|J(B fe ) = 



V|J(B*) 



V|j(B fe ) 



nMRxMR 



with Vg J(B ) = C C components are arranged along its diagonal area (there are M 
components). 

Then, for 25 to be the auxiliary function, we must prove: 

1. e5(25 T ,25 T ) =a(95 T ), 

2. <5(93 fcT ,Q3 feT ) =3(95 fcT ), 

3. e5(<8 T ,Q5 T ) < <S(93 T ,Q3 fcT ), and 

4. ©(<B T ,<8 fcT ) <6(53 feT ,95 fcT ), 

so that j(95 T ) < 3(25 fcT ). Because 05 is equivalent to B with reordered rows, this implies 
j(B( fc+1 )) < j(B fe ), which is the nonincreasing property of the sequence J(B fe ). The 
first and second will be proven in theorem [6j the third in theorem [JJ and the fourth in 
theorem [8] 

Theorem 6. 0(Q5 T , <8 T ) = 3(<8 T ) and ©(<8 feT , 25 feT ) = 3(53 feT ) . 

Proof. These are obvious from the definition of © in eq. [30] □ 

Theorem 7. 0(Q3 T ,«B T ) < <S(*8 T , 93 feT ) . Moreover if and only ifB k satisfies the KKT 
conditions in eg. EH then <S( $ B T , < B T ) = C5(«B T , Q5 fcT ) . 

Proof. By substracting eq . [30l from eq. [32l we get: 



0( < B T ,Q5 fcT ) -<S(03 T ,O3 T ) = -tr{(®-«8 fc )(D- V|j(B fc )) (05 - Q5 fc ) T } 



i M 



b m -b k m )(p m -ViJ(B k ))(b m -b k m f 



If D m — V| j(B fe ) Vm are all positive definite, then the inequality always holds except 
when b m = Vm. Thus, it is sufficient to prove the positive definiteness of D m — 
k ^ Vm. 



V b J ( B 1 

Let = b m — b k n ^0, then we must prove 



Note that 



v£(D ro -V|j(B fc ))v m >0 
(b^x fe ) 



if r G X„ 
if r I„ 
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with X fe = C k C kT = V| J(B fe ) and D m are symmetric. Thus, 

v^(D- - V|j(B*))v m = + ^ 5" ~ £ «r«. 

, r _2 mr r— 1 mr , r g= j 

\ ~> 2^s=l u ' rs \' J m) s \ " \ " 
>l^ V r p 2^2^V r V s X rs 

r=l mr r=l s=l 



1 " " x (b fc 1 1 w fl a; (b k ) 

r—1 s—1 
R R 

VrVs 



2^^ r b k 2 ^ ^ s b k 

r=l s=l mr r=l s=l ms 

R R 



r=l s=l 
R R 



Ifjk hjk 



2 



where u r is the r-th entry of v m and x rs is the (r, s) entry of X. Therefore, D' m — 
Vg j(B fc ) Vm are positive definite, and consequently the equality happens if and only if 
B = B fc which by the update rule in eg. l26l and the boundedness theorem [TBI happens if 
and only if B fc satisfies the KKT conditions. □ 

Theorem 8. ©(03 T ,03 fcT ) < 0(O3 fcT , 03 fcT ) . Moreover, if and only if B satisfies the 
KKT conditions in eq.\M then <5(03 T ,03 feT ) = <S (93* T , m kT ) . 

Proof. 

<8(<8 kT ,<8 kT ) - (5(03 T ,03 feT ) =-tr {(<B-«8 fc )V !8 r5(® fcT )} 

- itr {(03 - 0S fe )D(03 - 0S fc ) T }. 

By using eq. ED and the fact that D is positive semi-definite: 

0(Q5 fcT ,«8 fcT ) - <S(03 T ,Q3 fcT ) = -tr {(03 - 03 fc )D(03 - 03 fc ) T } > 0, 

we proved that 0(O3 T ,O3 fcT ) < ©(03 fcT , 03 fcT ) . Now, let's prove the second part of the 
theorem. By the update rule eq. if B fc satisfies the KKT conditions, then B will be 
equal to B fc , and thus the equality holds. Now we need to prove that if the equality holds, 
then B fe satisfies the KKT conditions. 

To prove this, let consider a contradiction situation where the equality holds but B fc 
does not satisfy the KKT conditions. In this case, there exists at least an index (m, r) 
such that: 

h =L h k and rl m — K ' mr > 

o mr f= o mr ana a rr _ s . 

mr mr 

Note that by the definition in eq. [Ml if b k lr is equal to zero, then it satisfies the KKT 
conditions. Accordingly, b mr = b k nr which violates the condition for the contradiction. 
So, b k nr cannot be equal to zero, and thus d™ is well defined. Consequently, 

0(O3 feT , 03 feT ) - <g(03 T , 03 feT ) > ( Kir ~ fc ^ - > 0, 
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which violates the equality. Thus, it is proven that if the equality holds, then B fc satisfies 
the KKT conditions. □ 

The following theorem summarizes the nonincreasing property of j(B fe ). 

Theorem 9. j(B fe+1 ) < j(B fe ) Vfc > under update rule eq. [M\ with the equality 
happens if and only if B k satisfies the KKT conditions in eq. \21\ 



Proof. This theorem is the corollary of theorem [6j [3 and [8] 



□ 



B. The nonincreasing property of j(C fe ) Now we prove the nonincreasing property 
of J(C fc ), i.e., eq.HU j(C( fc+1 )) < j(C k ) V/c > 0. Note that to prove this, B k and C k 
must be bounded. The boundedness of B and C k will be proven in theorem 1161 

By using the auxiliary function approach, the nonincreasing property of j(C fc ) can 
be proven by showing that: 

j( G (fc+i)) = G ( C ( k+1 \c( k+ V) < G(& k+1 \C k ) < G(C k ,C k ) = j(C k ). 

To define auxiliary function G, C is rearranged into: 

Cl 



C2 



Cat 



vNRxN 



where c„ is the n-th column of C. And also let's define: 

rv C 3(c fe ) 1 



v c a(c fe ) ; 



v c a(c* 



*NRxN 



where V C 3(C fc ) is the n-th column of V C J(C fc ) = B^+^B^C* - B( fe+1 ) T A 



yC k C kT C k 



aC k . And: 



vNRxNR 



D = diag (D 1 ,...,^) G 
where D™ is a diagonal matrix with its diagonal entries defined as: 



j(*+i)r R (fc + i) j=,fc 



C fc +aC fe C fcT C fc ' 



if r e I n 
if r g l n 



with 



l n = {r\c k rn > 0, VcJ(C fc ) rn ^0, or 
c k rn = 0, V c J(C fc ) rn < 0} 

is the set of non-KKT indices in n-th column of C fc , and * is defined as before. 
Then, the auxiliary function (S can be written as: 

<8(£,£ k ) = Z(£ k ) +tr {(£-£ k ) T V € 3(£ k )} + hr {(£-£ k ) T ~D(£-£ k )}. 



(33) 
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Also: 

v c c5(e:,e: fc ) = d(c - c fe ) + v c 3(e: fc ). 

Since D here is equivalent to D in j(B fc ) case, ©(£, £ ) is a strict convex function, and 
consequently has a unique minimum, so that: 

D(C - £ fc ) + V c a(£ fc ) = 0, (34) 

er = £ fe - D- 1 v e a(e: A; ), 

which is exactly the update rule for C in eq. [571 

By using the Taylor series, alternative formulation for can be written as: 

3(e) =3(c fe ) +tr {(e:-i>: fc ) T Vea(e: fc )}+ 

itr { (<£ - £ fc ) T V 2 c J(C fc ) (<£ - <£*) } + 4 (35) 
where £q is the higher components of the Taylor series: 

4 = itr { (C - £ k ) T (6a£ k ) (C - £ fe ) T ( £ ~ OH 

^tr { (C - £ fe ) T (C - £ fe ) (6al) (C - £ fe ) T - <£*) } , 

and 

V^J(C fe ) ee 

with j(C fc ) = B( fc+1 ) T B( fc+1 ) + 3aC k C kT — al components are arranged along its 
diagonal area (there are N components). 

As before, for 25 to be the auxiliary function, we must prove: 

1. «(£,£) =3(<£), 

2. ©(£ fc ,e: fc ) =a(£ fc ), 

3. «(£,£) < ©(£,e: fe )' and 

4. ©(c,e: fe ) < ©(£ fc ,e: fc ), 

The first and second will be proven in theorem llOl the third in theorem II 11 and the fourth 
in theorem 1121 

Theorem 10. ©(£,£) =3(£), and&(£ k ,£ k ) = 3(£ fe ), 

Proof. These are obvious from the definition of © in eq. 1331 □ 

Theorem 11. Given sufficiently large Sq and the boundedness of B fc and C k , then it 
can be shown that ©((£,(£) < ©((£, C fc ). Moreover, if and only if C k satisfies the KKT 
conditions, then the equality holds. 



VhJ{C k ) 



v 2 c J(c*) 



uNRxNR 
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Proof. As ©(£,£) = 5(C), we need to show that ®(£,£ k ) - > for sufficiently 

large <5q. By substracting eq. [33] from eq. [35] we get: 

G5(£, e fc ) -5(C) = ^tr{(£-£ k f(D-V 2 c J(C k ))(£-£ k )} -s k c 

N 



{(c n -c k ) T (W-Vy(C k ))(c n -c k )} -e k c . (36) 



2 



Let v„ = c n — c k , then: 



v^(D" - j(C fc ))v„ = v^(D" + al - (B< fc+1 > T B< fc+1 > + 3aC*C fcT ))v n 

= v^(D" + ^D" + al - (B< fc+1 > T B< fc+1 > + 3aC fe C' £T ))v„, 

where D™ and S^T) 71 are diagonal matrices that summed up to D™, with 

C ( B < fc + 1 > r B<'= + 1 >C' ! + Q C' i C fcT C'=) r 1 if -cT 



if r ^Z„, " I * if r 

Accordingly, 

1 at r fl fi H ~| 

«(£, £*) - 3(C) =^ E E v2 rJ?r + S k C E <<C + « E U ™ 
n=l I. r — 1 r— 1 r— 1 J 

- i E v,T(B( fc + 1 ) T B( fe+1 ) + 3aC fc C feT )v„ - e k c . (37) 



n=l 

As shown, with the boundedness of B fe and C fe and by sufficiently large <5q, &(<£, £) < 
©(£, can be guaranteed. Next we prove that if and only if C k satisfies the KKT 
conditions, then the equality holds. 

If C fc satisfies the KKT conditions, then this is obvious by eq. 1361 regardless of Sq. 
And by eq. [371 since 5^ is a variable, the equality happens if and only if C = C k which 
by the update rule in eq. [27] and the boundedness of B fc and C k happens if and only if 
C fc satisfies the KKT conditions. This completes the proof. □ 

Note that a should not be adjusted to ensure ©(<£, C) < ©(£, £ fc ), since not only Eq 
contains a, but also a has a role in determining the orthogonality degree of C which 
should be determined from the start as a constant. 

Theorem 12. ©(£, € k ) < &(£ k ,£ k )- Moreover if and only if C k satisfies the KKT 
conditions in eq. \EH then ©((£, £ k ) = &(€ k , £ k ) . 

Proof. 

<5(£ k ,£ k ) ~<&(£,£ k ) = -tr { (C - £ k ) T V € 3(£ kT )} - hi {(£ - € k ) T T>(£ - £ k )} . 
By using eq. [34] and the fact that D is positive semi-definite: 

®(£ k ,£ k ) -<5(<t,<£ k ) = ^tr {(C-C*) T D (C-C fc )} > 0, 



18 



By the update rule eq.[571 if C k satisfies the KKT conditions, then C = C k , and therefore 
the equality holds. Now we need to prove that if the equality holds, then C k satisfies the 
KKT conditions. 

To prove this, let consider a contradiction situation where the equality holds but C k 
does not satisfy the KKT conditions. In this case, there exists at least an index (r, n) 
such that: 

k (B( k +V T B( k +^C k + aC k C kT C k ) rn + 5 k c S k c 

c rn 7^ c rn an d d„ — ~ ^ ~^T~ ■ 

Cm C rn 

Note that by the definition in eq.[25l if c k n is equal to zero, then c rn = c k n which violates 
the condition for the contradiction, so c k n cannot be equal to zero. Consequently, 

6(£ fc ,£ fc ) - 6(£,£ fc ) > (Cr " ^ > 0, 

Cm 

which violates the equality. Thus, it is proven that if the equality holds, then C k satisfies 
the KKT conditions. □ 

Theorem 13. Given sufficiently large and the boundedness o/B fe and C k , J(C' £+1 ) 
< j(C fe ) Vfc > under update rule eq. \27\ with the equality happens if and only if C k 
satisfies the KKT conditions in eq. \21[ 

Proof. This theorem is the corollary of theorem [TUl HU an d HU □ 

C. Convergence guarantee of algorithm \E\ To show the convergence of algorithm 
[5l the following statements must be proven [2]: 

1. the nonincreasing property of sequence j(B fc , C fc ), i.e., j(B (fc+1) ,C (fc+1) ) < j(B (fc+1) 
< j(B fc ,C' c ), 

2. any limit point of sequence {B fc , C fc } generated by algorithm[5]is a stationary point, 
and 

3. sequence {B fe , C fc } has at least one limit point. 

The first will be proven in theoremll4[ the second in theoremll51 and the third in theorem 
[TBI Note that satisfying the KKT conditions is sufficient for stationarity. 

Theorem 14. Given sufficiently large Sq and the boundedness o/B fe and C k , j(R( k+1 \ 
C (fc+1)) < j(B<- k+1 \ C fe ) < j(B fc 7 C fc ) under update rules in algorithm H with the 
equalities happen if and only if (B fe , C fe ) is a stationary point. 

Proof. j(B^ k+1 \C k ) < J(B fc , C k ) is due to theorem [§J with the equality happens if and 
only if B fc satisfies the KKT conditions. And for sufficiently large and the boundedness 
ofB fc and C k , J (B ( - k+1 \ C^ k +^) < j(B ( - k+1 \C k ) is due to theorem [TJ with the equality 
happens if and only if C k satisfies the KKT conditions. And by combining theorem [9J 
and [T3l algorithm [5] will stop updating sequence J(B fe , C fe ) if and only if both B fe and 
C k satisfy the KKT conditions., i.e., (B fc , C k ) is a stationary point. □ 

Theorem 15. Given sufficiently large (5q and with the boundedness of B fc and C k , it 
can be shown that any limit point of sequence {B^C*} generated by algorithm^ is a 
stationary point. 
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Proof. By theorem UM algorithm[5]produces strictly decreasing sequence J(B fc , C fc ) until 
reaching a point that satisfies the KKT conditions. Because J(B fc , C fc ) > 0, this sequence 
is bounded and thus converges. And by combining results of theorem [51 and [T51 algorithm 
E] stop updating j(B fc , C fc ) if and only if (B fc , C fe ) satisfies the KKT condit ions. And 
by update rules in algorithm [5j after a point satisfies the KKT conditions, the algorithm 
will stop updating (B fc , C fc ), i.e., B( fc+1 ) = B k and C( fc+1 ) = C k Mk > * (* is the first 
iteration where the stationarity is reached). This completes the proof. □ 

Theorem 16. Sequence {B' c ,C fc } has at least one limit point. 

Proof. As stated by Lin [2], it suffices to prove that sequence {B fc ,C fc } is in a closed 
and bounded set. The boundedness of {C fe } is clear by the objective in eq. [501 if there 
exists I such that limcj,„ — > oo, then limJ(B',C z ) — > oo > J(B°,C°) which violates 
theorem [Til And if {B fc } is not bounded, then there exists I such that \imb l mr — > oo, 

b l mr < bll^ 1 ^. Because due to theorem [HI J(B fe ,C fc ) is bounded, then c l rn Vn must be 
equal to zero. And if c l rn = Vn, then Vb J(B z , C') = Vm, so that b!^ 1 ' = b l mr Vm 
which conflicting the condition for unboundedness of B'. Thus, B z is also bounded. With 
nonnegativity guarantee from theorem [SJ it is proven that {B fc ,C fc } is in a closed and 
bounded set. □ 

Algorithm [6] shows some modifications to algorithm [5] in order to guarantee the con- 
vergence as suggested by theorem [JJ] HH and [T51 with step is a constant that determines 
how fast grows in order to satisfy the nonincreasing property. 



Algorithm 6 Converged algorithm for UNMF. 



Initialization, B° > 0, C° > 0. 
for k = 0, . . . , K do 



h (k+l) , h k b rnr X V B J(B k , C k ) mr 

° mr mr (K k C k C kT ) +5 ' 



S k c ^6 
repeat 



C ( fe+ D ^_ c fc _ £,xV c .7(B*+i,C*) rm 

rn C B (fc+l)T B (fe+l) C fc +aC fc C feT C fe) + g k - 

< — <5c x step 

until j(B( fc+1 ),C( fc + 1 )) < j(B( fe+1 ),C fe ) 
end for 



4.2 Converged bi-orthogonal NMF 

Converged algorithm for BNMF will be derived equivalently as in UNMF case. However, 
we will not cut the steps in deriving the algorithm. The readers can refer to algorithm^ 
for the final form. 
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First, let's define BNMF objective with following: 

mm J(B, C, S) = ±||A - BSC||| + |||CC T - I||| + ^\\B T B - (38) 

s.t. B > 0, C > 0, S > 0, 

with a and j3 are constants to adjust the degree of orthogonality of C and B respectively. 
The KKT function of the objective can be written as: 

L(B, C) = J(B, C) - tr (r B B T ) - tr (r s S T ) - tr (r c C) . 

And the KKT conditions are: 

B* > 0, S* > 0, C* > 0, 

v B J(b*) = r B > o, v s j(s*) = r s > o, v c J(c*) = rg > o, (39) 

V B J(B*) ©B* = 0, V S J(S*)0S* = O, v c J(c*) (DC* = 0, 

where 

V B J(B) = BSCC T S T - AC T S T + /3BB T B - /3B, 
V c J(C) = S T B T BSC - S T B T A + aCC T C - aC, 
V S J(S) = B T BSCC T - B T AC T . 

Then, the MU algorithm for objective in eq. l38l can be written as: 

(AC T S T + (3B) 

h « b - P 

mp mp (BSCC T S^ + PBB T B) mp ' 

(S T B T A + aC) 
Cqn Cqn (S^BSC + aCC r C) ' 

\ / qn 

(B T AC T ) 

^ > pq 

Spq M (B^BSCC T ) ' 

V /pq 

The complete MU algorithm is given in algorithm^ and the AU version is in algorithm 

There are c k n , and s k q in algorithm [8] which are the modifications to 6^ p , dl n , 
and Sp q to avoid the zero locking. The following gives their definitions. 

—i f b k if V B J(B fc ,S fc ,C fc ) >0 

>i maxft p) a) ifV B J(B fe ,S fc ,C fc ) mp <0 ' 



8. 



-k _ J 9" 
u qn — 



c*„ ifV c J(B(^ 1 ) | S fc J C*) (|B >0 
max(c*„,a) if V c J{B^\ S k , C k ) qn < 



k = j s k q ifV s j(B( fc + 1 ),S fc ,C( fe + 1 )) p9 >0 



S P<? 



(44) 
(45) 



max(s^, a) if V s J(B ( - k+1 \ S k , C^ 1 )) g < ' 
with er is a small positive number, B, C, and S are matrices that contain b mp , c qni and 
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Algorithm 7 The MU algorithm for BNMF problem in eq. 
Initialization, B° > 0, C° > 0, and S° > 0. 
for k = 0, . . . , K do 





- b k 




c k - 















(AC feT S fcT + /3B fe ) 

vm,p 



(B fc S*C fe C fcT S fcT + ^B fc B fcT B fe ) +5 



S feT B (fe+l)T A + aQ k\ 

^ Vg,n 



( S feT B (fe+i)T B (fe+i) S fe C fc + a c fe C feT C fe ) 9n + <5 

(• B (fc+l)T AC (fc+l)T^ 



(• B (fc+l)T B (/c+l) S fe C (fe+l) C (fe+l)Ty + § 



Vp,g 



end for 



Algorithm 8 The AU algorithm for BNMF problem in eq. E3 
Initialization, B° > 0, C° > 0, and S° > 0. 
for k = 0, . . . , K do 





- 6 fe 




- c k 




^qn 


s (fe+l) <_ 


- s k 


A P9 ^ 


b pq 



t>m P x V B J(B fe ,S fc ,C fc ) mp 

Vm,p (40) 



( B fc S fc C fc C fcT S fcT + /3B fc B fcT B fe ) mp + 5 



C k X 



v c J(B fe+1 ,s fe ,c fc ; 



C S fcT B (fc+l)T B (fe+l) S A ;C fe + a c fe C fcT C fe ) + 5 k 
V / on ^ 



Vq.n (41) 



^ x V s J(B fc+1 ,S fc ,C( fc+1 )) w 
( B (fc+i)T B (fc+i)sfcc(fc+i)C( fc + 1 ) T ) p(? + <5| 

end for 



Vp,q (42) 
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s pq respectively. And: 

V B J(B fc , S fe , C fc ) = B fe S fe C fc C fcT S fcT - AC fcT S feT + (3B k B kT B k - /3B k , 

V c J(B k+1 , S fc , C fc ) = s fcT B( fe+1 ) T B( fc+1 )S fc C fe - S kT B^ T A+ 
aC k c kT G k _ aC k^ 

V s J(B fe+1 , S fc , C k+1 ) = B^ k+1)T B ik+1) S k C (k+1) C {k+1)T - B (k+1)T AC (k+1)T . 



As in subsection 14. 11 due to the zero locking, there is no convergence guarantee for 
algorithm!?] And also as in subsection 14. 11 5^, 6^, and #| in algorithm[8]are variables that 
play crucial roles in guaranteeing the convergence of the algorithm. Note that, algorithm 
[7] must be initialized with positive matrices to avoid the zero locking from the start, 
and nonnegative matrices can be used to initialize algorithm [5] The following theorem 
explains this formally. 

Theorem 17. If B° > 0, C° > 0, and S° > 0, then B k > 0, C k > 0, and S k > 

Vfc > 0. And if B° > 0, C° > 0, and S° > 0, then B k > 0, C k > 0, and S k > Vfc > 

Proof. This statement is clear for k — 0, so we need only to prove for fc > 0. 

Case 1: VeJmp > => b mp = b mp . 

„ s (B fe S fc C fe C feT S fcT + /3B fc B fcT B fc ) b k +6 k b k 
mp - ( B fc S fc C fc C fcT S fcT + pB k B kT B k ) + 5| 

( B k S k C k c kT s kT + ^k-QkT-Qk _ AC kT 8 kT _ p B k\ b k 

\ < < / mp ul P 



( B k S k C k C kT S kT + BB k B kT B k ) 

\ 1 I mp 



[(AC kT S kT +pB k ) mp + 5 k B ]b k 



Bj mp 



^ B k S k C k C kT S kT + {3B k B kT B k ) mp + 5^ ' 
Thus, if b k np > then 1] > Vm,p, and if b k mp > then b^ lp 1] > Vm,p, Vfc > 0. 
Case 2: V B ^m P < => b mp ^ 

u mp ■ 

, , max (6* ,a) x V B J(B fc , S fe , C fc ) 



m P m P (BfcS fe C fc C fcT S fcT + /3B fe B feT B fc ) mp + ^' 

Note that max (b r fe „ p , a) > and V B J(B k , S fc , C fc ) mp < 0. Thus if b k mp > then 
b { mp X) > Vm,p, and if b^ p > then b^ 1} > Vm,p, Vfc > 0. 

Case 5: V c J 9 « > =>• c g „ = c gn . 

( S fcT B (fe+l)T B (fc+l) S fc C fc + aC fe C feT C fc) c fc + jfc c k 

(fc+1) ^ / qn Q n ^ Q n 

Cqn ~ (SfcT B (fe+l)T B (A:+l) S fc C fc + a c*C fcT C fc ) + 6% 

/ S fcT B (fe+i)T B (fe+i) S fe C fe + aC fe C fcT C fe - S feT B( fc+1 ) T A - aC k ) c* 

V / qn Q n 

(SWB(Hi)r B (*+i)s*c* + aC fc C feT C fe ) ~ + 6%, 



[(S feT B( fc+1 ) T A + aC k ) n + 5 k c ]c 



k 

qn 



( S feT B (fc+i)T B (fc+i) S fc C fc + aC k C kT C k ) qn + <5£ ' 
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Thus if c k n > then c q k n +1) > Vq, n, and if c k n > then c q k n +1) > Vq, n, Vfc > 0. 
Case 4: V c J qn < =>■ c 9 „ ^ c gn . 

(fc+1) _ fc \ qn' J ^ V ' > Tgra 

9" gn ( S fcT B (fc+l)T B (fc+l) S fc C fc + aC fc C fcT C fc) n + ^fc ■ 



Note that max (c£„,ct) > and V c J(BC £+1 ), S fc , C fc ) ?n < 0. Thus if c£ n > then 
c q k „ +1) > Vg, n, and if c£„ > then c^ +1) > Vg, n, Vfc > 0. 



Case 5: V s J pg > s pq — s pq . 

, (fc+1) _ (B^+^B^^C^C^H^ + jj. 



(• B (fc+l)T B (fc+l) S fc C (fc+l) C (fc+l)T) + £fc 
^ B (fc+l)T B (fc+l)gfc C (fc+l) C (fc+l)T _ B (fc+l)T AC (fc+l)T-) s fc^ 
( B (fc+l)T B (fc+l) S fc C (fc+l) C (fc+l)T^ 7_^fc 

[(B(Hi)r AC (H 1) T ) ^ + ^ ] ^ m 

( B (fc+l)T B (fc+l) S fc C (fc+l) C (fc+l)T)^ + Jfc ' 

Thus if s k pq > then s ( pq +1) > Vp, q, and if > then s p h q +1) > Vp, g, Vfc > 0. 

Case 6: VsJ pq < => s pg 7^ c P9 . 

(fc+i) _ fc \ pq 1 / v ' ' /p? 

A p? p« ( B (fc+i)T B (fc+i)sfcc(fc+i)C( fc + 1 ) T ) p9 + <5| ' 



Note that max(s^,cr) > and V s J(BC c+1 ) , S fe , C( fe+1 )) p < 0. Thus if s k pq > then 

4, +1) > Vp, q, and if s k q > then s ( pq +1) > Vp, q, Vfc > 0. 

By combining the results for fc = and for fc > in case 1-6, the proof is completed. □ 



4.2.1 Convergence analysis 

We will now analyze convergence property of algorithm[8] As stated previously, the nonin- 
creasing property of sequence </(B fc , S k , C k ) need to be proven first as it is the necessary 
condition for the convergence. And because algorithm [8] uses alternating strategy, the 
nonincreasing property can be analyzed separately. 

A. The nonincreasing property of J(B fe ) By using the auxiliary function approach, 
the nonincreasing property of J(B fe ) can be proven through: 

j( B (fe+D) = G(B( fe+1 ),B( fc+1 )) < G(B( fc+1 \B fc ) < G(B k ,B k ) = j(B fc ). 

To define G, let's rearrange B into: 

'hi 



25 T = 



24 



where b„, is the m-th row of B. And also let's define: 



V B 3(B fc )[ 



v B a(B^ 



tiMP x M 



where V B 3(B fe ) m is the m-th row of V B J(B fc ) = B fc S fc C fe C feT S A:T -AC A:T S fe + ( 9B fc B feT B fc - 
/3B fe . Then define: 

D = diag (D 1 , . . . , D M ) e R*f PxMP , 
where D m is a diagonal matrix with its diagonal entries defined as: 



d m = 



( B k s k c k c kT s kT + f jB k B kT B k\ +g k 
V / mp ' 



if p el„ 

if pgl„ 



with 



l m = {p\b k mp > 0, V B j(B fc ) mp ^ 0, or 
b k mp = 0, V B j(B fe ) mp <0} 

is the set of non-KKT indices in m-th row of B fc , and * is defined as before. 
Then, the auxiliary function © can be defined as: 

©(03 T ,03 fcT ) =3(93 fcT ) +tr {( $ 8~Q3 fc )V< 8 Ta( < B fcT )} 



1 



tr {(<B-Q3 fe )D(?8-<8 fe ) T }. 



(46) 



Note that whenever X^ fc+1 ) is a variable, we remove the (k + 1) sign, and 
V< 8 Te5( s B T , ( B feT ) = D(03 - Q3 fe ) T + V< BT 3( s B kT ). 

By definition, D is positive definite for all B fe not satisfy the KKT conditions, so ©(03 T , < B kT ) 
is a strict convex function, and consequently has a unique minimum. 



D(<8 - 03* 



* fe ) T + v^a(» feT ) = o, 

03 T = 03 fcT - D- 1 V !S T3r(®* r ^ 



which is exactly the update rule for B fc in eq. [40] 

By using the Taylor series expansion, -3(93 T ) can also be written as: 



(47) 



3(0S T ) =3(Q3 fcT ) +tr {(03 - «8 fc )Vg,T3(«8 feT )} 



-tr {(03 - 03 fc )V B J(B fc )(03 - 03 fc ) T } +4 



B- 



(48) 



where 



4 =-tr { (95 - 03 fc ) (6£0S fcT ) (© - 03 fc ) (58 - <B fc ) T }- 
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tr {(03 - «B fc ) (23 - Q3 fc ) T (6/3l) (03 - 03 fc ) (03 - 03 fe ) T } 



25 



and 



V B J(B fe ) = 



V|J(B*) 



zMPxMP 



V|J(B fc ). 

with V| j(B fc ) = S fc C fe C fcT S feT + 3[3B kT B k - (31 components are arranged along its 
diagonal area (there are M components). 

Then, for 25 to be the auxiliary function, we must prove: 

1. 25(25 T ,25 T ) = 3(25 T ), 

2. (5(® fcT ,95 fcT ) = 3(25 fcT ), 

3. 25(25 T ,25 T ) < <g(25 T ,25 fcT ), and 

4. 25(25 T ,25 fcT ) < 25(25 fcT ,25 feT ), 

so that 3(25 T ) < 3(25 fcT ). This implies j(B ( - k+1 ^) < j(B k ). The first and second will 
be proven in theorem 1181 the third in theorem 1191 the fourth in theorem 1201 and the 
boundedness of B fe , C fe , and S fc will be proven in theorem [32] 

Theorem 18. 25(25 T ,25 T ) = 3(25 T ) and ©(25 feT , «8 feT ) = 3(25 fcT ). 

Proof. These are obvious from the definition of © in eq. [46] □ 

Theorem 19. Given sufficiently large <5g and the boundedness ofB k , C k , and S k , then 
it can be shown that 25(05,25) < 25(25,25'°). Moreover, if and only if B k satisfies the 
KKT conditions, then the equality holds. 

Proof. As 6(25, 25) = 3(25) , we need to show that 25(25, 25 fc ) - 3(25) > for sufficiently 
large S^. By substracting eq. 25] from eq. US] we get: 



1 



6(25, 25 fc ) -3(25) = -tr{(25 - 25 fc )(D- V^J(B fe ))(25 - 25 fc ) } - e 



1 M 

\ J2 [(b m - b*,)(D« - V|J(B*))(f, m - b k m f 

m—1 



(49) 



Let v„, = b m — b k , then: 



(D' 



V| J(B fc ))v m = v£(D m + /3I - (S fe C fe C feT S feT + 3(3B kT B k ))v 



= v T (D' 

v m V 



■/3I 



where D m and (J^D" 1, are diagonal matrices that summed up to D m , with 



I BSC C 



"pp — 



if p e X m and i 
if p ^ I m , 



s fe c fe c feT S feT + 3j gB fcT B fc ))v ro , 
ith 

pr- if P G Xn 



if p £ I„ 



Accordingly, 
0(25 



M r P 

m—1 \p— 1 



P 

°B / y u mp u pp 
p=l 



P=l / 



1 M 

i £ v^(S fc C fe C feT S fcT + 3/3B feT B fc )v m - e*. 



(50) 
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As shown, with the boundedness of B fe , C k , and S k and by sufficiently large 5^ , © (93, 93) < 
©(93,93 fc ) can be guaranteed. Next we prove that if and only if B fc satisfies the KKT 
conditions, then the equality holds. 

If B fe satisfies the KKT conditions, then this is obvious by eq. H^l regardless of <5g. 
And by eq. [50] since 5^ is a variable, the equality happens if and only if B = B fe , which 
by the update rule in eq. 00] happens if and only if B fe satisfies the KKT conditions. This 
completes the proof. □ 

Theorem 20. ©(93,93 fe ) < ©(OS*, 93 fe ) . Moreover if and only ifB k satisfies the KKT 
conditions in eg. [M then ©(93,93 fc ) = ©(93 fe ,93 fc ). 

Proof. 

©(93 fe ,33 fc ) -©(93,93 fc ) = -tr{(<8- $ 8 fe )V< B a( $ 8 fcT )} 

- -tr {(93 - 9S fc )D(9S - 93 fe ) T }. 

By using eq. 27] and the fact that D is positive semi-definite: 

©(93 fc ,9S fc ) - ©(93, %$ k ) = hi {(93 - 93 fc )D(93 - 93 fc ) T } > 0. 

By the update rule eq.HU] if B fc satisfies the KKT conditions, then B = B fc , and therefore 
the equality holds. Now we need to prove that if the equality holds, then B fc satisfies the 
KKT conditions. 

To prove this, let consider a contradiction situation where the equality holds but B fe 
does not satisfy the KKT conditions. In this case, there exists at least an index (m,p) 
such that: 

(B fe S fc C fe C feT S fcT + /3B fc B fcT B fc ) +6^ s k 
b mp + b mp and d pp = p > p-. 

Note that by definition in eq. S3] if b k np is equal to zero, then b mp = b k np which violates 
the condition for the contradiction, so b k cannot be equal to zero. Consequently, 



©(<8 fc , <B fc ) - ©(<8, ® fe ) > (&mp . fc b "' p)2 ^ > °. 

which violates the equality. Thus, it is proven that if the equality holds, then B fe satisfies 
the KKT conditions. □ 

Theorem 21. Given sufficiently large <5g and the boundedness of B fc , C k , and S k , 
j(B fe+1 ) < j(B fe ) Vfc > under update rule eq. \40\ with the equality happens if and 
only if B fc satisfies the KKT conditions in eq. \39l 

Proof. This theorem is the corollary of theorem [TSj [19l and [20] □ 

B. The nonincreasing property of j(C' c ) Next we prove the nonincreasing property 
of J(C fc ), i.e., J(C(' £ + 1 )) < j(C fe ) Vfe > 0. 

By using the auxiliary function approach, the nonincreasing property of j(C fe ) can 
be proven by showing that: 

j( C (fc+i)) = G ( C (' : + 1 ),c( fe + 1 )) < G(C( fe+1 ),C fc ) < G(C k ,C k ) = j(C k ). 
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To define G, C is rearranged into: 

ci 



C2 



where c„ is the n-th column of C. And also let's define: 
"Vc3(C' £ ) 1 



V c 3(£ fe ) 



-k\ _ 



v c a(c fc ) 



>N 



^NQxN 



where V c 3(C fc ) n is the ra-th column of V c J(C k ) = S kT B ( - k+1 '> T B ( - k+1 '>S k C k -S kT B ( - k+1 '> T A+ 
aC k C kT C k -aC k . And: 

D = diag(D 1 ,...,D JV ) ER^ QxNQ , 
where D" is a diagonal matrix with its diagonal entries defined as: 

/ s fcT B (fc + i)T B (fc+i) s fcgfe_ | _ Q gfcgfcrgfc , \ +g k 

& ^ if q G In 

* if q £ In 



with 



X n = {q\<* n >0, V C J(C*)^0, 
c k qn = 0, VcJ(C fe ) on <0} 



is the set of non-KKT indices in n-th column of C fc , and * is defined as before. 
Then, the auxiliary function 25 can be written as: 



25 



Also: 



(€,€ k ) ee a(£ fe ) +tr{(c-e: fc ) T v e 3(e: fc )} + itr{(G:-G:' £ ) T D((!:-(!: fc )}. 



(51) 



v c 25(e,e fc ) = D(<r-£ fc ) + v c a(£ fc ). 

Since 25 ((£, C fc ) is a strict convex function, it has a unique minimum. 

D(e-e fe ) + v c a(£ fc ) = 0, 
£ = e fe -D- 1 v £ a(e:' £ ), 



(52) 



which is exactly the update rule for C in eq. [41] 

By using the Taylor series, alternative formulation for 3(e) can be written as: 

3(e) = 3(e fc ) + tr {(e - e fc ) T v c 3(e fc ) }+ 

itr {(e-€ k ) T VhJ(C k )(€-€ k )} +e k c 



(53) 
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where Eq is the higher components of the Taylor series: 



4 = itr { (<£ - £ fc ) T (6aC fc ) (C - £ fc ) T (<£ - £ fc ) } + 

itr { (C - £ fc ) T (C - £ fc ) (6al) (C - £ fc ) T (<£ - £ fe ) } , 



and 



V^J(C fc ) 



pJVQxAfQ 



with V£. J(C fc ) = S feT B( fc + 1 ) T B( fc+1 )S fc + 3aC fc C fcT -oil components are arranged along 
its diagonal area (there are N components). 

As before, for 25 to be the auxiliary function, we must prove: 

1. =3(<£), 

2. 0(c fc ,e: fc ) =a(£ fc ), 

3. <S (<£,£) < <5(£,£ fe ), and 

4. 0(er,er fc ) < ©(£ fe ,e: fc ), 

The first and second will be proven in theorem l2"2l the third in theorcm[531 and the fourth 
in theorem [24l 



Theorem 22. ©(<£,<£:) = 3(£), and ©(e^, £ fe ) = 3(£ fc ), 
Proof. These are obvious from the definition of © in eq. 1511 



□ 



Theorem 23. Given sufficiently large Sq and the boundedness ofH k , C k , and S k , then 
it can be shown that ©(£,£) < <&(€., £ fc ). Moreover, if and only if C k satisfies the KKT 
conditions, then the equality holds. 

Proof. As ©(£,£) = 3(C), we need to show that «(£,£*) -5(C) > 0. By substracting 
eq. [51] from eq. EH we get: 

«(£, - 3f((£) = i tr {(£- £ fe ) T (D - V 2 C J(C*)) ((£ - £ fc ) } - e* 



1 w 



c n -c*)^-V 2 c J(C fe ))(c n -c* 



F k 

e c . 



(54) 



Let v„ = c„ — c n , then: 

v^(D" - VJb J(C fc ))v„ = v£(D" + c&- (S fcT B( fe + 1 ) T B( fc+1 )S fc + 3«C fe C fcT ))v„ 

= v^(D" + (5q D" + al - (S^B^+^B^S* + 3aC fc C fcT ))v„, 

where D™ and <5qD™ are diagonal matrices that summed up to D™, with 



if q G X„ and ^ 
if o ^ I„, 



if g e In 
★ if g ^ I„ 
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Accordingly, 



1 N ( Q Q Q } 

«(c t k ) - 3(c) =\ E E « +*cE + ° E <£. 

n — 1 ^ q— 1 9—1 9—1 / 

_ i ^ v T( S feT B ( fc +l)T B ( fc+ l) s fc + 3aC fc C fcT) Vn _ £ fc _ (55) 
n=l 

As shown, with the boundedness of B fe , C k , and S fc , and by sufficiently large Sq, 
©(£, £) < ©(£, can be guaranteed. Next we prove that if and only if C fc satisfies the 
KKT conditions, then the equality holds. 

If C fc satisfies the KKT conditions, then this is obvious by eq. I5"4l regardless of S^.. 
And by cq. [55J since Sq is a variable, the equality happens if and only if C = C k which 
by the update rule in eq . |4T1 happens if and only if C k satisfies the KKT conditions. This 
completes the proof. □ 

Theorem 24. ©(£, £ k ) < <&(£ k ,£ k ). Moreover if and only if C k satisfies the KKT 
conditions in eg. El then <S(£, <t k ) = <&{<t k ,€ k )- 

Proof. 

®(£ k ,£ k ) - &(£,£ k ) = -tr {(<£- £ k ) T V € 3(£ kT )} - -tr { (<t - £ k ) T B(£ - £ fc ) }. 
By using eq. [52] and the fact that D is positive semi-definite: 

&(<t k ,<t k ) -<8(<£,<£ k ) = 2 tr { ( £ - £ k ) T V(£ - £ k ) } > 0, 

By the update rule eq.SU if C k satisfies the KKT conditions, then C = C fc , and therefore 
the equality holds. Now we need to prove that if the equality holds, then C k satisfies the 
KKT conditions. 

To prove this, let consider a contradiction situation where the equality holds but C fc 
does not satisfy the KKT conditions. In this case, there exists at least an index (q, n) 
such that: 

( S fcT B (fc+i)T B (fc+i) S fc C fc +a c fc C fcT C fc ) +5 k s k 

r ^r k and d n - - qn > 

Cqn T c qn ana a qq - s . . 

^qn ^qn 

Note that by the definition in eq. 04] if c k n is equal to zero, then c qn = c k n which violates 
the condition for the contradiction, so c k cannot be equal to zero. Consequently, 



cL) 2 s k 



<5(C k ,£ k )-®{£X k ) > Kq _J L > 0, 

c qn 



which violates the equality. Thus, it is proven that if the equality holds, then C k satisfies 
the KKT conditions. □ 

Theorem 25. Given sufficiently large Sj^ and the boundedness of B fc , C , and S k , 
j(C fc+1 ) < j(C fe ) Vfc > under update rule eq. \41\ with the equality happens if and 
only if C k satisfies the KKT conditions in eq. \39i 

Proof. This theorem is the corollary of theorem [32J H31 arid □ 
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C. The nonincreasing property of j(S fe ) Next we prove the nonincreasing property 
of j(S fe ), i.e., j(S< fe+1 )) < j(S fe ) Vfc > 0. 

By using the auxiliary function approach, the nonincreasing property of j(S fc ) can be 
proven by showing that: 

j( S (k+i)) = G(S( fe+1 ),S( fc+1 >) < G(S< fc+1 >,S fc ) < G(S k ,S k ) = j(S k ). 

To define G, S is rearranged into: 

si 



6 



S2 



S Q. 



where s q is the q-th column of S. And also let's define: 



V e 3(6 fe ) = 



v s a(s fc ) 1 



Vs3(S fc ) 2 



Vs3(S fe ) ( 



where V s 3(S fe ) is the g-th column of V s J(S fc ) = B( fe+1 ) T B(' £ + 1 )S' £ C(' £ + 1 )C( fe + 1 ) T 

B (k+1)T AC (k+l)T ^ And . 

D^diag (D 1 ,...,D«)eRf xi, « 
where D 9 is a diagonal matrix with its diagonal entries defined as: 

/ B (fc + l)T B (fc + l)gfc c (fc + l) c (fc + l)T\ +S: 



d q = 



if pel, 

if p £ X q 



with 



2, = {Pl4 > 0, V S J(S fc ) p? ^ 0, or 
4=0, V s J(S fc ) <0} 



is the set of non-KKT indices in g-th column of S fe , and * is defined as before. 
Then, the auxiliary function © can be written as: 

(3(6, 6 fe ) = a(© fe )+tr{(©-6 fe ) T V e a(6 fc )} + itr{(e-6 fe ) T D(6-6 fe )}. (56) 
Also: 

v e ©(6, e k ) = r>(&-e k ) + v e a(6 fe ). 

Since & (&, & k ) is a strict convex function, it has a unique minimum. 

D(6-6 fe ) + V©3(6 fe ) =0, (57) 
© = e k -D- 1 V 6 3(6 fe ), 
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which is exactly the update rule for S in eq. S2J 

By using the Taylor series, alternative formulation for 3(6) can be written as: 

3(6)= a(6 fc )+tr{(6-6 fc ) T Ve3(6 fe )} + itr{(6-6 fe ) T VlJ(S fc )(©-©' £ )} (58) 



where 



vlJ(s fe ) 



ViJ(S* 



vgj(s*). 



t,PQxPQ 



with V|j(S fe ) = B( fc+1 ) T B(' c+1 )C( fe + 1 )C( fc+1 ) T components are arranged along its diag- 
onal area (there are Q components). 

For 25 to be the auxiliary function, we must prove: 

1. 0(6,©) =3(6), 

2. ©(6 fe ,6 fe ) = 3(6 fe ), 

3. 0(6,6) < 0(6, 6 fe ), and 

4. 0(6, 6 fe ) < 0(6 fc ,6 fc ), 

The first and second will be proven in theorem l26l the third in theorem \27\ and the fourth 
in theorem [23 



Theorem 26. 0(6,6) =3(6), <md©(6 fe ,6 fe ) =3(6 fc ) ; 
Proof. These are obvious from the definition of © in eq. 1561 



□ 



Theorem 27. Given sufficiently large <5| and the boundedness o/B fc , C k , and S k , then 
it can be shown that ©(6,6) < ©(6, 6 fc ) . Moreover, if and only ifS k satisfies the KKT 
conditions, then the equality holds. 

Proof. As ©(6, 6) = 3(6), we need to show that ©(6, 6 fe ) -3(6) > 0. By substracting 
eq. [56] from eq. [53 we get: 



©(6, e k ) - 3(6) = £ tr {(6 - 6 fc ) T (D - V| J(S fe )) (6 - 6 fc ) } 



-^E[K-^) T (D 9 -VU(S fc ))(s 9 -s 



9=1 



(59) 



Let v, 



q — s q — s*, then: 



v^(D« - VlJ(S fc ))v g = v^(D 9 - ( B ( fc + 1 ) T B( fc+1 )c(' £ + 1 )c( fc + 1 ) T ))v 9 

= v^(D« + <y§6« - (B( fe+1 ) T B(' £ + 1 )c( fc + 1 )c(' £ + 1 ) T ))v 9 , 

where D 9 and <5gD 9 are diagonal matrices that summed up to D 9 , with 



d q = 



^ B (fc + l)T B (fc + l)gfc c (fc + l) c (fc + l)T^ 



if p G 2, and ^ = 

if p ^ 2g, 



^ if P G X 9 

★ if P ^ Zg. 
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Accordingly, 

©(6, e fe ) - a(e) =^E|E ^< + *§ E <A 

q=l yp=l p=l J 

1 Q 

^v^B^+^B^C^C^^jv,. (60) 



9=1 

As shown, with the boundedness of B fe , C , and S fe , and by sufficiently large 6g, <&(&,&) < 
©(S,6 fc ) can be guaranteed. Next we prove that if and only if S fe satisfies the KKT 
conditions, then the equality holds. 

If S fc satisfies the KKT conditions, then this is obvious by eq. [SHI regardless of <5§. 
And by eq. [60] since <5§ is a variable, the equality happens if and only if S = S fc which 
by the update rule in eq. 02] happens if and only if S k satisfies the KKT conditions. This 
completes the proof. □ 

Theorem 28. &(&,& k ) < ®(6 k ,& k ). Moreover if and only if S k satisfies the KKT 
conditions in eq.\M then & (&,& k ) = &(6 k 1 & k ). 

Proof. 

<&(e k ,& k ) -&(&,& k ) = -tr {(e-e fe ) T v e a(6 fcT )} - -tr {(©-6 fe ) T D(6-e fe )}. 

By using eq. 1571 and the fact that D is positive semi-definite: 

<5(e k ,& k ) -0(6,e fe ) = -tr {(6-e fe ) T D(6-6 fc )} > 0, 

By the update rule eq. 021 if S k satisfies the KKT conditions, then S = S fc , and therefore 
the equality holds. Now we need to prove that if the equality holds, then S k satisfies the 
KKT conditions. 

To prove this, let consider a contradiction situation where the equality holds but S 
does not satisfy the KKT conditions. In this case, there exists at least an index (p, q) 
such that: 



( B (fc+i)T B (fc+i)sfc C (fc+i) C (fc+im +S k k 

s s fc and d'i - - Pq > 



Note that by the definition in eq. [45J if s k q is equal to zero, then s pq = s k q which violates 
the condition for the contradiction, so s k q cannot be equal to zero. Consequently, 

©(6 fe , & k ) - 0(e, & k ) > ~_ s p 6 -i > o, 

S pq 

which violates the equality. Thus, it is proven that if the equality holds, then S k satisfies 
the KKT conditions. □ 

Theorem 29. Given sufficiently large i5| and the boundedness of Ti k , C k , and S k , 
j(S k+1 ) < J(S k ) V/c > under update rule eq. \42\ with the equality happens if and 
only if S k satisfies the KKT conditions in eq. \39i 

Proof. This theorem is the corollary of theorem [511 and [25] □ 
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D. The convergence guarantee of algorithm [8] To shown the convergence of algo- 
rithm [51 the following statements must be proven: 

1. the nonincreasing property of sequence J(B k , S fe , C fc ), i.e., j(B( fc+1 ), S( fc+1 ), 
C( fe+1 )) < j(B< fe + 1 ), S fc , C( fe+1 )) < j(B( fe+1 ), S fc , C fe ) < j(B k , S fc , C fc ), 

2. any limit point of the sequence {B fc , S k , C fe } generated by algorithm [5] is a sta- 
tionary point, and 

3. the sequence {B fc , S fc , C fe } has at least one limit point. 

The first will be proven in theorem l301 the second in theorem l31[ and the third in theorem 

[221 

Theorem 30. Given sufficiently large 6^, Sq, and <5|, and the boundedness ofB k , C k , 
andS k , j(B( fe+1 ), S( fc+1 ', C< fc+1 )) < j(B^ k+1 \ S k , C( fc+1 )) < j(B ( - k+1 \ S k , C k ) < 
j(B k , S k , C fc ) under update rules in algorithm^ with the equalities happen if and only 
if (B k , S k , C fc ) is a stationary point. 

Proof. j(B ( - k+1 \ S fe , C fe ) < j(B fc , S fc , C k ) is due to theorem [21] with the equality hap- 
pens if and only if B fc satisfies the KKT conditions. j(B^ k+1 \ S fc , C( fc+1 )) < j(B^ k+1 \ 
S k , C fe ) is due to theorem [231 with the equality happens if and only if C k satisfies the 
KKT conditions. And j(B^ k+1 \ S( fe+1 \ C( fe+1 )) < j(B^ k+1 \ S fe , C^ 1 )) is due to the- 
orem [29] with the equality happens if and only if S fc satisfies the KKT conditions. And 
by combining theorem |2"T1 [231 and 1231 algorithm [5] will stop updating sequence J(B k , S k , 
C k ) if and only if B fc , C k , and S fc satisfy the KKT conditions, i.e., (B fc , S fe , C k ) is a 
stationary point. □ 

Theorem 31. Given sufficiently large 5^, S 1 ^, and 5$, and the boundedness of B k , C k , 
and S k , it can be shown that any limit point of sequence |B fc ,S fc ,C fe } generated by 
algorithm \^is a stationary point. 

Proof. By theorem [331 algorithm [3] produces strictly decreasing sequence J(B k , S k , C fc ) 
until reaching a point that satisfies the KKT conditions. Because J(B k , S k , C fc ) > 
0, this sequence is bounded and thus converges. And by combining the results of the 
theorem Ell [251 [291 algorithm [8] will stop updating J(B k , S fc , C fc ) if and only if (B fc , S fc , 
C fc ) satisfies the KKT conditions. And by the update rules in algorithm [SI after a point 
satisfies the KKT conditions, the algorithm will stop updating (B fc , S fc , C fe ), i.e., B^ fc+1 ^ 
= B fe , C( fc+1 ) = C fc , and S( fc+1 ) = S fc Vfc > *. This completes the proof. □ 

Theorem 32. The sequence {B k ,S k ,C k } has at least one limit point. 

Proof. It suffices to prove that sequence {B fc ,S fc ,C fe } is in a closed and bounded set. 
The boundedness of {B fc } and {C fe } are clear by the objective in eq. [38j if there exists I 
such that lim b l mp — > oo or lim c qn — > oo, then lim J — > oo > J(B°, S°, C°) which violates 
theorem l30l And if {S fc } is not bounded, then there exists I such that lims^ — > oo, 
s pq < s pq- Because due to theorem [331 J(B k , S fc , C k ) is bounded, then either b l mp Vra 
or cL, Vn must be equal to zero. If 6' = Vm, then Vs J pq = \/q, so that s p l q = s l pq . 
And if c l qn — Vn, then VsJ P9 = Vp, so that Sp^ 1 ^ = s l pq . Both cases contradict the 
condition for unboundedness of S l . Thus, S l is also bounded. 

With the nonnegativity guarantee from theorem [T71 it is proven that {B fc , S fc , C fe } 
is in a closed and bounded set. □ 



34 



Algorithm IHl shows modifications to algorithm [S] in order to guarantee the convergence 
as suggested by theorem E01 ED an d EH with step is a constant that determine how fast 
, Sq, and 5| grow in order to satisfies the nonincreasing property. Note that we set the 
same step value for all sequences, but setting different values can also be employed. 

Algorithm 9 Converged algorithm for BNMF problem 
Initialization, B° > 0, C° > 0, and S° > 0. 
for k = 0, . . . , K do 

repeat 

u{k+l) , Uk b rnp X VnJ(B k ,S k ,C h ) mp 



m P mp ( B fc S fc C fc C fcT S fcT +/3B fe B' £ ' r B fc ) mp + ^ 

(5g < — <5g x step 

until j(B ( - k+1 \S k ,C k ) < j(B fc ,S fe ,C fe ) 

5% ^8 
repeat 

Jk+i) , r k Cgn x Vc I /(B fc + 1 ,S^,C fc ) 9 » 

qn qn (SkT B (k+l)T B (k+l) S k C k +aC kCkTCk) + tf* 



C 



5 C < — <5c x step 
until j(B( fe + 1 ),S fc ,C( fe + 1 )) < j(B( fc+1 ),S fc ,C fc ) 



repeat 



(k+1) ^ k ^xV 8 J(B^,S*,C(^)) w 

P9 PI (Vl(k+l)Tvt(k+l)akr<(k+l)r'(k+l)T\ j_ Xk Vp ' q 



B (/c+l)T B (fe+l) S fe C (fe+l) C (fe+l)T) +S 

)pq 



S 



(5g < — <5g x step 

until j(B( fe + 1 ),S( fc+1 ),C( fc + 1 )) < j(B( fc+1 ),S fc ,C( fe+1 )) 
end for 



5 Experimental Results 

Experiments are conducted to analyze and compare properties and performances of al- 
gorithm [T] (LS), algorithm [2] (D-U), algorithm El (D-B), algorithm H (MU-U) , algorithm 
[U(AU-U), algorithm [7] (MU-B), and algorithm M (AU-B). Here, LS is used as the bench- 
mark. All algorithms are developed in Octave under linux platform, and the experiments 
are conducted by using a notebook with 1.86 GHz Intel processor and 2 GB RAM. 
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Table 1: Statistics of the test datasets 



The data 


#doc 


#word 


%nnz 


max 


min 


Reuters2 


6090 


8547 


0.363 


3874 


2216 


Reuters4 


6797 


9900 


0.353 


3874 


333 


Reuters6 


7354 


10319 


0.347 


3874 


269 


Reuters8 


7644 


10596 


0.340 


3874 


144 


Reuters 10 


7887 


10930 


0.336 


3874 


114 


Reuters 12 


8052 


11172 


0.333 


3874 


75 



Table 2: Sizes of the top 12 topics 



class 


1 


2 


3 


4 


5 


6 


#doc 


3874 


2216 


374 


333 


288 


269 


class 


7 


8 


9 


10 


11 


12 


#doc 


146 


144 


129 


114 


90 


75 



5.1 The datasets 

To evaluate the algorithms, we use the Reuters-21578 data corpus^, a standard dataset 
for testing learning algorithms and other text-based processing methods. The dataset 
is especially interesting because many NMF-based clustering methods are tested using 
it, e.g., [Till El [H] ■ The Reuters-21578 contains 21578 documents with 135 topics class 
created manually with each document is assigned to one or more topics based on its 
content. The Reuters-21578 are available in two formats: SGML and XML version. The 
dataset is divided into 22 files with each file contains 1000 documents and the last file 
contains 578 documents. 

In this experiments, we use the XML version. We use all but the 18 th file because 
this file is invalid both in its SGML and XML version. We use only documents that 
belong to exclusively one class (we use "classes" for refeering the original grouping, and 
"clusters" for referring groups resulted from the clustering algorithms). Further, we 
remove the common English stop wordfd, and then stem the remaining words by using 
Porter stemmer |25j and remove words that belong to only one documents. And also, we 
normalize the term-by document matrix A by: A AD~ 1//2 where D = diag(A T Ae) 
as suggested by Xu et al. [19]. We form test datasets by combining top 2, 4, 6, 8, 10, 
and 12 classes from the corpus. Table Q] summarizes the statistics of these test datasets, 
where #doc, #word, %nnz, max, and min refer to number of document, number of word, 
percentage of nonzero entry, maximum cluster size, and minimum cluster size respectively. 
And table [5] gives sizes (#doc) of these top 12 classes. 

5.2 The nonincreasing property 

The nonincreasing property, even though does not guarantee the convergence, is still 
a very important property since usually good results can be achieved by having this 
property. Moreover, unlike the stationarity, this property is easy to evaluate. Here we 
will show that while MU-U and MU-B — which do not have convergence guarantee — fail 

1 http: / /kdd. ics.uci.edu/databases/reuters21578/reuters21578. html 
2 http: / /snowball. tartarus.org/algorithms/english/stop.txt 
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Table 3: Time comparison (seconds) for Reuters4 dataset. 



a//3 


MU-U 


AU-U 


MU-B (a) 


AU-B(a) 


MU-B(/3) 


AU-B(/3) 


0.01 


110 


110 


121 


41.1 


122 


27.2 


0.05 


110 


110 


121 


40.9 


121 


40.7 


0.1 


109 


109 


121 


40.8 


121 


41.2 


0.3 


110 


109 


121 


40.4 


121 


41.1 


0.7 


110 


110 


121 


272 


121 


41.2 


1 


110 


110 


121 


40.8 


121 


273 


3 


110 


110 


121 


40.4 


121 


40.7 


7 


110 


110 


121 


40.4 


121 


273 


10 


110 


110 


121 


40.8 


121 


41.1 


30 


109 


110 


121 


272 


121 


442 


70 


109 


137 


121 


332 


121 


525 


100 


110 


232 


121 


382 


121 


605 


300 


110 


232 


121 


514 


121 


579 


700 


110 


461 


121 


607 


121 


606 


1000 


110 


411 


121 


606 


121 


365 



to show this property for large a and/or (3, AU-U and AU-B — which have convergence 
guarantee — can consistently achieve the desired results even for large a and/or j3. Note 
that, even though LS [1] has this property, it doesn't imply that other MU based algo- 
rithms will inherit it. As shown in figure [2] and 02 the original orthogonal NMF algorithms 
(D-U and D-B) which based on the MU rules do not have this property. 

Figure 0] show error per iteration produced by MU-U as a function of a. As the error, 
we use the UNMF objective in eq. HOI As shown, the nonincreasing property vanishes as 
a grows. And not only the errors are rather large, but also the algorithm seems to fail 
to settle for large a values. On the other hand, as shown in figure [5j AU-U preserves 
the nonincreasing property even for large a values (AU-U uses the same error as MU-U) . 
Interestingly, as shown in figure [5(b)] the errors for a = 300 are even smaller than the 
errors for a = 100 and a = 70. And since ^||CC T — 1|[ is part of the objective in eci.l2T)l 
the small errors for large a values in AU-U indicate that Cs produced by AU-U are much 
more row-orthogonal than those produced by MU-U. 

Figure [BHU show the equivalent results for BNMF cases. Because there are two ad- 
justable parameters, a and /?, we fix one parameter while studying the other. Figure [6] 
and [7] show the results for fixed (3—1, and figure [8] and |9] for fixed a — 1 . As in UNMF 
cases, while MU-B fails to show the nonincreasing property for large a and /3 values, 
AU-B successfully preserves this property regardless of a and (3 values. Note that we set 
S = a = 10~ 8 , and step = 10 for MU-U, AU-U, MU-B, and AU-B in all experiments. 

Bowever, there are computational tradeoff for these accuracies as for large a and/or 
(3, AU rules based algorithms are slower than their MU counterparts. Table [3] shows time 
comparisons between these algorithms for Reuters4 dataset. Note that, a or (3 appended 
to the algorithm's acronyms to tell which parameter is being varied. For example AU-B (a) 
means AU-B with fixed f3 and varied a. 

As shown in table [3l the computational times of MU algorithms practically are in- 
dependent from a and (3 values. And AU algorithms seem to become slower for some 
large a or /?. This probably because for large aor/3 values, the AU algorithms execute 
the inner iterations (shown as repeat until loops in algorithm [6] and [9]) . Also, there 
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Figure 4: MU-U error per iteration for Reuters4 dataset 
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Figure 5: AU-U error per iteration for Reuters4 dataset 
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Figure 6: MU-B(a) error per iteration for Reuters4 dataset (/3 = 1). 
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Figure 7: AU-B(a) error per iteration for Reuters4 dataset (/3 = 1). 
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Figure 9: AU-B(/3) error per iteration for Reuters4 dataset (a = 1). 
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Table 4: #iter and #initer of AU algorithms (Reuters4). 



a/0 


AU-U 


AU-B(a) 


AU-B(/3) 




#iter / ^initer 


#itcr / #initer 


#iter / $dniter 


0.01 


20/0 


3/0 


2/0 


0.05 


20/0 


3/0 


3/0 


0.1 


20/0 


3/0 


3/0 


0.3 


20/0 


3/0 


3/0 


0.7 


20/0 


20/0 


3/0 


1 


20/0 


3/0 


20/0 


3 


20/0 


3/0 


3/0 


7 


20/0 


3/0 


20/0 


10 


20/0 


3/0 


3/0 


30 


20/0 


20/0 


20 / 44 


70 


20/7 


20 / 23 


20 / 66 


100 


20 / 32 


20 / 22 


20 / 88 


300 


20 / 32 


20 / 65 


20 / 81 


700 


20 / 92 


20 / 75 


20 / 88 


1000 


20 / 79 


20 / 90 


20 / 24 



are some anomalies in the AU-B(a) and AU-B(/3) cases, where for some a or /3 values, 
execution times are unexpectedly very fast. To investigate these, we display number of 
iteration (#iter) and inner iteration (#initer) for AU algorithms in table [4] Note that 
MU algorithms reach maximum predefined number of iteration for all cases: 20 iterations. 

As shown in table [4] when AU algorithms perform worse than their MU counterparts, 
then they execute the inner iteration which happened for large a/ (3. And when AU 
algorithms perform better, then their #iter are smaller than #iter of MU algorithms 
(and the inner iteration is not executed). These explain the differences in computational 
times in table [3] 

5.3 Maximum number of iteration 

Maximum number of iteration is very crucial in MU and AU algorithms since these 
algorithms are known to be very slow [3 [HI El [10l [16l [18l EEl EU [22l [23] . As shown by 
Lin [23], LS is very fast to minimize the objective for some first iterations, but then tends 
to become slower. In table [SJ we display errors for some first iterations for LS, MU-U, 
AU-U, MU-B, and AU-B. We choose the cases where a = 0.1 and (3—1 since for these 
values, our algorithms are settled. Note that errorO refers to the initial error before the 
algorithms start running, and errorn is the error at n-th iteration. 

As shown in table [5] all algorithms are exceptionally very good at reducing errors in 
the first iterations. But then, the improvements are rather negligible with respect to the 
first improvements and the sizes of the datasets. Accordingly, we set maximum number 
of iteration to 20. 

5.4 Determining a and f3 

In our proposed algorithms, there are two dataset-dependent parameters, a and /3, that 
have to be learned first. Because orthogonal NMFs are introduced to improve clustering 
capability of the standard NMF [11] , these parameters will be learned based on clustering 
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Table 5: Errors for some first iterations (Reuters4). 





errorO 


error f 


error2 


error3 


error4 


error5 


LS 


f373 


0.476 


0.474 


0.472 


0.469 


0.466 


MU-U 


4652 


1.681 


1.603 


1.596 


1.591 


1.583 


AU-U 


4657 


1.681 


1.605 


1.595 


1.586 


1.573 


MU-B 


f2474 


2.164 


2.104 


2.103 


2.102 


2.102 


AU-B 


f2680 


2.137 


2.104 


2.103 







results on test dataset. We will used Reuters4 for this purpose. These parameters do 
not exist in the original orthogonal NMFs [11] nor in other orthogonal NMF algorithms 
[IHEI1H1]. However, we notice that our formulations resemble sparse NMF formulation 
[SI HI HB] , or in general case also known as constrained NMF [ 17; . As shown in ref. [5] .51 EH] , 
sparse NMF usually can give good results if a and/or /3 are rather small positive numbers. 

To determine a and /?, we evaluate clustering qualities produced by our algorithms as 
a or (3 values grow measured by the standard metrics: mutual information (MI), entropy 
(E), purity (P), and Fmeasure (F) (see section 15.6.11 for discussions on these metrics). 

As shown in figure HO] for UNMF algorithms (MU-U and AU-U) a — 0.1 seems to be 
a good choice. For MU-B it seems that a = 0.1 and (3 = 3 are acceptable settings. And 
for AU-B, a — 0.7 and (3 = 1 seem to be good settings. Based on this results, we decide to 
set a = 0.1 and (3=1 for all datasets and algorithms. Note that, other mechanisms like 
using some small samples for deriving optimal as and (3s for each dataset and algorithm 
may be a better choice since every dataset can have different characteristics. 

5.5 Times, ^iterations, and errors 

To evaluate computational performances of the algorithms, we measure their average and 
maximum running times, average and maximum ^iterations, and average and maximum 
errors produced at the last iterations for 10 trials. Table [B][5] show the results. 

As shown in the table [Ti] LS generally is the fastest with exception when MU-B or AU- 
B converge before reaching the maximum iteration (20 iterations), then these algorithms 
will outperform LS. Our uni-orthogonal algorithms (MU-U and AU-U) seem to have 
comparable running times with LS. MU-B seems to be slower for smaller datasets and 
then performs better than MU-U and AU-U for bigger datasets: ReuterslO and Reutersl2. 
Since AU-B usually converges before reaching the maximum iteration, comparison can be 
done by using maximum running times for Reuters4, Reuters6, ReuterslO, and Reutersl2 
in which the data is available (see table [7J . As shown, AU-B is the slowest to perform 
calculation per iteration. There are also abrupt changes in the running times for ReuterslO 
and Reutersl2 for all algorithms which are unfortunate since as shown in tableQ] the sizes 
of the datasets only slightly change. Figure [TT] shows the bar chart of average running 
times as the sizes of the datasets grow. 

Average and maximum errors at the last iterations are shown in table [5] Results 
for D-U and D-B support the previous results: algorithm [5] and [3] do not minimize the 
objectives that are supposed to be minimized, i.e., eg. l9l and[T9l Because only MU-U & 
AU-U and MU-B & AU-B pairs have the same objective each, we compare average errors 
for these pairs in figure [12] There is no significant difference between MU-U & AU-U in 
the average errors, but as shown in figure lll| MU-U has better average running times 
especially for larger datasets. And for MU-B & AU-B, the differences in the average errors 
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Figure 10: Clustering qualities as functions of a or (3 for Reuters4. 
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Table 6: Average and maximum running time. 



Data 


Time 


LS 


D-U 


D-B 


MU-U 


AU-U 


MU-B 


AU-B 


Reuters2 


Av. 
Max. 


77.266 
79.031 


83.655 
84.743 


104.98 
106.25 


78.068 
79.075 


77.825 
79.176 


66.318 
83.960 


38.367 
49.477 


Reuters4 


Av. 
Max. 


108.84 
109.39 


119.42 
119.55 


152.77 
153.17 


109.04 
109.20 


109.12 
109.28 


119.46 
119.72 


86.745 
271.40 


Reuters6 


Av. 
Max. 


134.02 
134.50 


149.32 
149.62 


194.43 
194.75 


133.91 
134.27 


134.19 
134.51 


149.63 
149.95 


75.432 
327.70 


Reuters8 


Av. 
Max. 


158.37 
181.58 


173.43 
175.71 


228.59 
235.54 


153.53 
155.15 


155.03 
159.19 


173.00 
174.05 


56.464 
59.021 


ReuterslO 


Av. 
Max. 


834.69 
1004.5 


892.91 
1141.2 


911.34 
1127.3 


874.18 
1137.5 


914.93 
1162.0 


859.31 
1059.0 


601.57 
2794.1 


Reutersl2 


Av. 
Max. 


1249.2 
1389.0 


1348.4 
1590.4 


1440.1 
1746.1 


1319.7 
1565.7 


1335.6 
1529.4 


1309.0 
1506.7 


1602.4 
4172.2 



Table 7: Average and maximum ^iteration. 



Data 


#iter. 


LS 


D-U 


D-B 


MU-U 


AU-U 


MU-B 


AU-B 


Reuters2 


Av. 


20 


20 


20 


20 


20 


16.2 


4.9 




Max. 


20 


20 


20 


20 


20 


20 


6 


Reuters4 


Av. 


20 


20 


20 


20 


20 


20 


7.2 




Max. 


20 


20 


20 


20 


20 


20 


20 


Reuters6 


Av. 


20 


20 


20 


20 


20 


20 


5.5 




Max. 


20 


20 


20 


20 


20 


20 


20 


Reuters8 


Av. 


20 


20 


20 


20 


20 


20 


4 




Max. 


20 


20 


20 


20 


20 


20 


4 


ReuterslO 


Av. 


20 


20 


20 


20 


20 


20 


5.6 




Max. 


20 


20 


20 


20 


20 


20 


20 


Reutersl2 


Av. 


20 


20 


20 


20 


20 


20 


8.8 




Max. 


20 


20 


20 


20 


20 


20 


20 



grow as the size and classes of the datasets grow with significant differences happened at 
ReuterslO and Reutersl2. However, as shown in table [71 AU-B is more likely to converge, 
so generally its running times are shorter. 

5.6 Clustering capability 

One of the prominent application of NMF is in clustering, which is reported to be better 
than the spectral clustering [15] . Especially, the orthogonal NMFs are designed to improve 
the clustering capability of the standard NMF 11 . Thus, the real assessment of the 
orthogonal NMFs qualities is in their clustering capability. 

5.6.1 The metrics 

There are some standard metrics in evaluating clustering quality. The most commonly 
used metrics are mutual information, entropy, and purity. We will use these metrics 
together with an additional metric, Fmeasure. In the following, the definitions of these 
metrics are outlined. 

Mutual information (MI) measures dependency between the clusters produced by the 
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Table 8: Average and maximum errors at the last iteration. 



Data 


#iter. 


LS 


D-U 


D-B 


MU-U 


AU-U 


MU-B 


AU-B 


Reuters2 


Av. 
Max. 


1.3763 
1.3854 


3435.6 
3587.2 


3626.5 
3867.4 


1.4106 
1.4201 


1.4138 
1.4230 


1.7955 
1.8022 


1.8021 
1.8025 


Reuters4 


Av. 
Max. 


1.4791 
1.4855 


9152.8 
9474.9 


8689.0 
9297.9 


1.5299 
1.5408 


1.5310 
1.5402 


2.0708 
2.0880 


2.0962 
2.1028 


Reuters6 


Av. 
Max. 


1.5229 
1.5301 


17135 
17971 


15823 
16955 


1.5844 
1.5884 


1.5878 
1.5952 


2.2627 
2.2758 


2.2921 
2.2998 


Reuters8 


Av. 
Max. 


1.5434 
1.5473 


25913 
27462 


22893 
25553 


1.6215 
1.6342 


1.6171 
1.6262 


2.3863 
2.3993 


2.4421 
2.4422 


ReuterslO 


Av. 
Max. 


1.5696 
1.5801 


34154 
35236 


30518 
35152 


1.6533 
1.6662 


1.6533 
1.6618 


1.8836 
1.9529 


2.5673 
2.5718 


Reuters 12 


Av. 
Max. 


1.5727 
1.5815 


42739 
44325 


37038 
41940 


1.6620 
1.6705 


1.6621 
1.6713 


1.8860 
1.9193 


2.6551 
2.6697 
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Figure 11: Average running time comparison as the datasets grow. 
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Reuters2 Reuters4 Reuters6 Reuters8 Reuters10Reuters12 



Reuters2 Reuters4 Reuters6 Reuters8 Reuters10Reuters12 



(a) MU-U and AU-U. 



(b) MU-B and AU-B. 



Figure 12: Average errors comparison as the datascts grow. 



algorithms and the reference classes. The higher the MI, the most related the clusters 
with the classes, and therefore the better the clustering will be. It is shown that MI is 
a superior measure than purity and entropy [26] because it is tolerant to the difference 
between #cluster and #class. MI is defined with the following formula: 



MI=J2J2p(r, s) log 2 



r— 1 s—1 



P(r,s) 
p(r)p(s) 



where r and s denote the r-th cluster and s-th class respectively, p(r, s) denotes the joint 
probability distribution function of the clusters and the classes, p(r) and p(s) denote the 
marginal probability distribution functions of the clusters and the classes respectively, 
and binary logarithm is used here (other bases are also possible). Note that because of 
inconsistency in the formulation of normalized MI (a more commonly used metric) in the 
literatures, we use MI instead. Accordingly, MPs values are comparable only for the same 
dataset. 

Entropy addresses the composition of classes in a cluster. It measures uncertainty in 
the cluster, thus the lower the entropy, the better the clustering will be. Unlike MI, if 
there is discrepancy between ^cluster and #class, entropy won't be very indicative about 
the the clustering quality. Entropy is defined with the following: 



entropy 



1 



oz r=l s=l 



R S 

Crs l0g 2 — , 



where N is the number of samples (#doc for document clustering), c rs denotes the number 
of samples in r-th cluster that belong to s-th class, and c r denotes the size of r-th cluster. 

Purity is the most commonly used metric. It measures the percentage of the dominant 
class in a cluster, so the high the better. As in entropy, purity is also sensitive to the 
discrepancy between ^cluster and ^class. Purity is defined with: 

1 * 

purity — — maxc rs . 
r=l 



49 



Table 9: Average mutual information over 10 trials (document clustering). 



Data 


LS 


D-U 


D-B 


MU-U 


AU-U 


MU-B 


AU-B 


Rcuters2 


0.40392 


0.42487 


0.36560 


0.47507 


0.42150 


0.057799 


0.00087646 


Reuters4 


0.62879 


0.61723 


0.48007 


0.65080 


0.63640 


0.32142 


0.072621 


Reuters6 


0.79459 


0.81831 


0.52498 


0.81811 


0.82425 


0.37924 


0.078201 


Rcuters8 


0.92285 


0.90260 


0.54534 


0.94165 


0.92720 


0.48435 


0.013518 


ReuterslO 


1.0415 


1.0275 


0.62125 


1.0063 


1.0138 


0.50980 


0.072014 


Reuters 12 


1.1326 


1.0865 


0.58469 


1.1195 


1.0821 


0.47697 


0.16389 


Average 


0.82071 


0.81283 


0.52032 


0.83523 


0.81754 


0.37160 


0.066853 



Table 10: Average entropy over 10 trials (document clustering). 



Data 


LS 


D-U 


D-B 


MU-U 


AU-U 


MU-B 


AU-B 


Reuters2 


0.54193 


0.52098 


0.58025 


0.47078 


0.52435 


0.88805 


0.94498 


Reuters4 


0.40202 


0.40780 


0.47638 


0.39102 


0.39822 


0.55571 


0.68011 


Reuters6 


0.38391 


0.37473 


0.48821 


0.37481 


0.37243 


0.54459 


0.66105 


Reuters8 


0.35568 


0.36242 


0.48151 


0.34941 


0.35423 


0.50184 


0.65879 


ReuterslO 


0.33601 


0.34023 


0.46253 


0.34661 


0.34434 


0.49608 


0.62786 


Reuters 12 


0.31953 


0.33239 


0.47236 


0.32319 


0.33362 


0.50241 


0.58974 


Average 


0.38985 


0.389760 


0.49354 


0.37597 


0.38787 


0.58145 


0.69375 



And Fmeasure combines two concept in IR: recall and precision. Recall measures the 
proportion of the retrieved relevant documents to all relevant documents, and precision 
measures the proportion of the retrieved relevant documents to all retrieved documents. 
In the context of assessing clustering quality, Fmeasure is defined with [27] : 



Fmeasure 



1 R 



precision r x recall r 
R ^— ' ' ' precision r + recall r 

r=l 



where precision,, and recall r denote the precision and recall of r-th cluster. 



5.6.2 Document clustering 

The results of document clustering are shown in table l9rfT2l In average, MU-U gives 
the best performances in all metrics especially for datasets with small ^clusters. Then 
followed by LS, AU-U, and D-U with small margins. LS seems to be better for datasets 
with large ^clusters. Generally, MU-U, LS, AU-U and D-U can give consistent results for 
variety ^clusters, but unfortunately this is not the case for D-B, MU-B and AU-B which 
are all bi-orthogonal NMF algorithms. AU-B especially seems to offer only slightly better 
clustering than random results. Note that even though there are adjustable parameters 
in MU-B and AU-B, it is unlikely that the poor results are due to these parameters. 



5.6.3 Word clustering 

In some cases, the ability of clustering methods to simultaneously group similar documents 
with related words (co-clustering) is a concern. And because the original bi-orthogonal 
NMF is designed to have this ability [TT] , we will also investigate the quality of word clus- 
tering (in the context of co-clustering) produced by all algorithms. Since word clustering 
has no reference class, we adopt idea from ref. |11] in which reference classes are created 
by using word frequencies: each word is assigned to class with the highest frequency. 
Table [T3l - [T51 show the results. 
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Table 11: Average purity over 10 trials (document clustering). 



Data 


LS 


D-U 


D-B 


MU-U 


AU-U 


MU-B 


AU-B 


Reuters2 


0.82154 


0.83599 


0.80452 


0.85089 


0.82507 


0.66102 


0.63612 


Reuters4 


0.79417 


0.78023 


0.73778 


0.80400 


0.79704 


0.70119 


0.59657 


Reuters6 


0.74510 


0.75158 


0.68844 


0.74868 


0.75069 


0.66433 


0.54569 


Reuters8 


0.74906 


0.73982 


0.66536 


0.74869 


0.73987 


0.65033 


0.50680 


Reuters 10 


0.73120 


0.73762 


0.64845 


0.72813 


0.73330 


0.63194 


0.50639 


Reuters 12 


0.73877 


0.72719 


0.62223 


0.74127 


0.72340 


0.60118 


0.52019 


Average 


0.76331 


0.76207 


0.69446 


0.77028 


0.76156 


0.65166 


0.55196 



Table 12: Average Fmeasure over 10 trials (document clustering). 



Data 


LS 


D-U 


D-B 


MU-U 


AU-U 


MU-B 


AU-B 


Reuters2 


0.81904 


0.83234 


0.79163 


0.84823 


0.82241 


0.58237 


0.50399 


Reuters4 


0.56154 


0.53754 


0.44352 


0.57989 


0.54267 


0.36917 


0.24585 


Reuters6 


0.46225 


0.47714 


0.33910 


0.48444 


0.47270 


0.26372 


0.17171 


Reuters8 


0.40408 


0.40554 


0.25052 


0.41822 


0.42996 


0.23904 


0.10869 


Reuters 10 


0.38001 


0.38041 


0.23309 


0.36923 


0.35947 


0.19552 


0.094912 


Reuters 12 


0.35671 


0.35811 


0.17387 


0.35214 


0.34435 


0.16401 


0.099949 


Average 


0.49727 


0.49851 


0.37196 


0.50869 


0.49526 


0.30231 


0.20418 



As shown in table [T3riT6l D-U has the best overall results followed by LS, MU-U and 
AU-U by small margins. MU-U is especially good for small ^clusters and LS is good 
for large #clusters. But unfortunately, all bi-orthogonal NMF algorithms, D-B, MU-B, 
and AU-B, which designed to accomodate co-clustering task, seem to have poor results. 
These results are in accord with document clustering cases where bi-orthogonal NMFs 
also perform poorly. 

6 Conclusions 

We have presented orthogonal NMF algorithms based on the additive update rules with 
rigorous convergence proofs. There are two versions of the converged algorithms: AU-U 
for uni-orthogonal NMF, and AU-B for bi-orthogonal NMF with their respective multi- 
plicative update rules versions: MU-U and MU-B. 

The only way to numerically evaluate whether the algorithm has converged to a sta- 
tionary point is to check whether it has satisfied the KKT conditions on that point. While 
the nonnegativity conditions are easy to check, the complementary slackness conditions 
are hard since we must check Vx</(X fe ) 0X l =0 Vfc > *. Not only there are some large 



Table 13: Average mutual information over 10 trials (word clustering). 



Data 


LS 


D-U 


D-B 


MU-U 


AU-U 


MU-B 


AU-B 


Reuters2 


0.15715 


0.16609 


0.12966 


0.17351 


0.14978 


0.013995 


0.00029807 


Rcuters4 


0.42558 


0.39193 


0.21495 


0.42619 


0.41663 


0.11812 


0.026943 


Reuters6 


0.54112 


0.57472 


0.26971 


0.54239 


0.54828 


0.12460 


0.035309 


Reuters8 


0.63022 


0.63368 


0.29277 


0.64699 


0.65774 


0.15692 


0.0037071 


ReuterslO 


0.70386 


0.73345 


0.33046 


0.66262 


0.68367 


0.025320 


0.029618 


Reuters 12 


0.80111 


0.77959 


0.28412 


0.76128 


0.73517 


0.013483 


0.073478 


Average 


0.54317 


0.54658 


0.25361 


0.53549 


0.53188 


0.075407 


0.028226 
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Table 14: Average entropy over 10 trials (word clustering). 



Data 


LS 


D-U 


D-B 


MU-U 


AU-U 


MU-B 


AU-B 


Reutcrs2 


0.76778 


0.75884 


0.79527 


0.75142 


0.77515 


0.91094 


0.92463 


Reuters4 


0.62965 


0.64647 


0.73496 


0.62934 


0.63412 


0.78338 


0.82897 


Reuters6 


0.56184 


0.54884 


0.66683 


0.56134 


0.55906 


0.72297 


0.75751 


Rcuters8 


0.52006 


0.51891 


0.63255 


0.51447 


0.51089 


0.67783 


0.72890 


Reuters 10 


0.50612 


0.49721 


0.61852 


0.51853 


0.51220 


0.71038 


0.70909 


Reuters 12 


0.48211 


0.48811 


0.62632 


0.49322 


0.50050 


0.70181 


0.68507 


Average 


0.57792 


0.57640 


0.67908 


0.57806 


0.58199 


0.75122 


0.77236 



Table 15: Average purity over 10 trials (word clustering). 



Data 


LS 


D-U 


D-B 


MU-U 


AU-U 


MU-B 


AU-B 


Rcutcrs2 


0.76987 


0.77082 


0.75378 


0.77730 


0.76021 


0.67006 


0.65988 


Reuters4 


0.64400 


0.62881 


0.60566 


0.64676 


0.64184 


0.55808 


0.53116 


Reuters6 


0.59830 


0.61733 


0.55949 


0.59763 


0.59103 


0.52966 


0.49661 


Reuters8 


0.59560 


0.58935 


0.54296 


0.59179 


0.58770 


0.50933 


0.46499 


Reuters 10 


0.58123 


0.60236 


0.51576 


0.57045 


0.58724 


0.44765 


0.45395 


Reuters 12 


0.60208 


0.59563 


0.49555 


0.58628 


0.56846 


0.43611 


0.44882 


Average 


0.63185 


0.63405 


0.57887 


0.62837 


0.62274 


0.52515 


0.50923 



Table 16: Average Fmeasure over 10 trials (word clustering). 



Data 


LS 


D-U 


D-B 


MU-U 


AU-U 


MU-B 


AU-B 


Rcutcrs2 


0.59287 


0.59471 


0.58733 


0.59696 


0.59427 


0.52628 


0.49976 


Reuters4 


0.46891 


0.43469 


0.36397 


0.48118 


0.46180 


0.32520 


0.27101 


Reuters6 


0.37490 


0.38365 


0.27356 


0.38648 


0.38026 


0.21620 


0.17572 


Reuters8 


0.32488 


0.32674 


0.20820 


0.33527 


0.34251 


0.17127 


0.12565 


Reuters 10 


0.29864 


0.30768 


0.18626 


0.28930 


0.28573 


0.10700 


0.10545 


Reuters 12 


0.29116 


0.29072 


0.14255 


0.27525 


0.27380 


0.088517 


0.095880 


Average 


0.39189 


0.38970 


0.29365 


0.39407 


0.38973 


0.23908 


0.21224 
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matrix multiplications which can be inaccurate numerically, but also we must make sure 
that the stationary point is reachable in a reasonable amount of time. Accordingly, only 
the nonincreasing properties were evaluated which as shown in section [5~2l the converged 
version of our algorithms kept these properties even for large a or /3. 

The maximum allowed ^iterations is an important issue in the multiplicative and 
additive update rules based NMF algorithms since these algorithms are known to be 
slow. As shown in table [SJ the multiplicative and additive update rules based algorithms 
were exceptionally very good at reducing the errors even in the first iterations, but then 
the errors were only slightly reduced for the remaining iterations. This inspired us to use 
20 iterations as the maximum ^iterations. Because this is a rather small number, it is 
very likely that the algorithms stop before reaching a stationary point. 

There are adjustable parameters in our proposed algorithms. These parameters are 
dataset-dependent, and thus should be learned based on the datasets. Because the objec- 
tives of our algorithms resemble the objectives of sparse NMFs, better clustering results 
probably can be achieved by using the same strategy: setting these parameters to small 
numbers. 

There were differences in the running times of the algorithms, but were not significant 
since all algorithms have the same computation complexity: # iterations x M x N x R, 
where M X N is the size of the data matrix, and R is the number of decomposition factors. 

The document clustering results favoured our MU-U algorithm in which it showed 
the best average performances for all used metrics followed closely by LS, AU-U, and 
D-U. MU-U was especially good for small #cluster and LS for large ^clusters. There is 
possibility that because we learned a from Reuters4 dataset, then MU-U performed best 
at the small datasets. But, because adjusting a for each different dataset is rather unfair, 
we believe that these are the best results can be offered by MU-U. All bi-orthogonal 
NMF algorithms, D-B, MU-B, and AU-B, performed rather poorly in these datasets, 
which was unfortunate since there are some works that show D-B is a better clustering 
method compared to LS and D-U [HJ [28] . 

The word clustering results were not as conclusive as the document clustering results 
since there is no a prior label to compare with. Here we used strategy from ref. [11] to 
assign the words to the classes. In this task, D-U offered the best overall performances 
followed closely by LS, MU-U and AU-U. As in the document clustering, all bi-orthogonal 
NMF algorithms also performed poorly in this task. 
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